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This  thesis  is  concerned  with  the  algorithmic  and  logic  design  of 
an  arithmetic  unit  to  be  used  in  a  computational  environment  in  which  the 
basic  arithmetic  operations  satisfy  the  on-line  property;  that  is,  to 
generate  the  j    digit  of  a  result  (where  a  digit  consists  of  n  bits  for 
base  2  ),  it  is  necessary  and  sufficient  to  have  the  operands  available 
only  up  to  the  j    digit  plus,  in  the  case  of  division,  a  predetermined 
number  of  extra  digits  which  correspond  to  an  "on-line  delay."   Since  there 
is  no  on-line  delay  for  addition,  subtraction,  and  multiplication,  the 
unit  can  begin  generating  result  digits  as  soon  as  one  digit  of  each 
operand  has  been  input.   The  delay  for  division  is  shown  to  be  a  small, 
positive,  radix  dependent  constant.   To  fulfill  the  on-line  requirements,  a 
set  of  left-to-right  (most-to-least  significant),  digit-by-digit  algorithms 
have  been  derived.   The  existence  of  such  algorithms  is  contingent  upon 
the  use  of  a  redundant  representation  for  the  result  digits.   These  algorithms 
and  a  block  diagram  level  implementation  of  the  basic  arithmetic  unit  are 
developed  in  the  thesis. 

The  proposed  arithmetic  unit,  capable  of  performing  on-line  operations, 
would  be  extremely  useful  in  many  real-time  applications.   Due  to  its 
potential  for  performing  sequences  of  operations  in  an  overlapped  fashion 
(pipelining) ,  the  unit  could  provide  an  effective  way  to  speed  up  execution. 
Furthermore,  it  is  ideally  suited  for  variable  precision  arithmetic. 
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1 .   INTRODUCTION 

This  thesis  is  concerned  with  the  development  of  a  set  of  basic 
(addition,  subtraction,  multiplication,  and  division)  arithmetic  algorithms 
suitable  for  use  in  a  computational  environment  which  calls  for  the  on-line 
processing  of  data.   In  on-line  processing  the  operands,  as  well  as  the 
results,  flow  through  the  arithmetic  unit  in  a  digit-by-digit  manner,  most 
significant  digit  first.   In  various  real-time  applications  in  which  the 
operands  are  generated  serially  by  an  analog-to-digital  conversion  process 
beginning  with  the  most  significant  digits,  an  arithmetic  unit  possessing 
this  on-line  property  is  highly  desirable.   A  unit  which  operates  in  an 
on-line  fashion  can  provide  the  ever  more  popular  microprocessor,  a  device 
traditionally  restricted  from  most  mathematical  applications  because  of 
its  short  word  length,  with  variable  precision  arithmetic  capabilities. 
At  the  same  time,  it  can  provide  for  overlapping  the  generation  of  result 
digits  with  the  fetching  of  operand  digits.   As  an  added  bonus,  the  user 
may  halt  the  processing  when  sufficient  precision  has  been  obtained,  which 
may  conceivably  occur  before  all  of  the  operand  digits  have  been  processed. 
With  these  thoughts  in  mind,  an  effort  has  been  made  to  develop  imple- 
mentable,  on-line  algorithms  for  the  basic  arithmetic  operations. 

1.1  Objectives 

A  set  of  implementable  on-line  algorithms  for  the  basic  arithmetic 
functions  should  meet  several  desirable  objectives.   Three  specific 


objectives  were  imposed  on  the  algorithmic  design  during  the  development 
stage  of  this  dissertation. 

OBJECTIVE  1:   The  algorithms  should  be  on-line  with 
respect  to  the  result  digits.   They  should  generate  the 
most  significant  digits  of  the  result  first,  in  such  a 
way  that  once  generated,  the  result  digit  produced  at 
step  j  would  not  be  affected  by  any  subsequent  step  k, 
k  >  j. 

OBJECTIVE  2:   The  algorithms  should  also  be  on-line 
with  respect  to  the  operand  digits.   Only  those  digits 
up  to  and  including  those  provided  at  step  j  should  be 
required  in  order  to  perform  the  j    step  of  the 
algorithm.   To  avoid  extreme  scaling  of  the  operands 
during  division,  a  limited  number  of  leading  digits  of 
each  operand  corresponding  to  an  "on-line  delay"  are 
accumulated  prior  to  starting  the  actual  division  algorithm. 

OBJECTIVE  3:   The  basic  computational  step  should  be 
invariant  at  every  step  j  and  the  only  primitive 
arithmetic  operation  should  be  addition.   The  selection 
procedure  generating  one  result  digit  per  step  should 
be  such  that  the  step  execution  time  is  independent  of 
the  operands;  i.e.,  the  selection  should  be  based  on 
a  limited  precision  model  of  the  operands. 

The  first  objective  implies  the  use  of  a  redundant  representation 
of  the  results.   Without  redundancy  the  problem  cannot  be  solved.   The 


second  objective  calls  for  a  more  complicated  basic  computational  step 
than  would,  otherwise,  be  necessary  to  allow  for  the  on-line  arrival  of 
operand  digits.   The  step  invariance  requirement  of  the  third  objective 
makes  the  control  section  of  the  implementation  very  straightforward.   While 
the  requirement  that  addition  be  the  only  primitive  operator  simplifies 
the  processing  section  of  the  implementation.   In  order  for  the  selection 
procedure  to  be  independent  of  the  length  of  the  operands  and,  thus,  step 
invariant,  a  limited  propagation  mode  of  addition  must  be  employed.   This, 
in  turn,  provides  for  a  cost  effective  speed  up  of  the  overall  algorithms. 

Once  algorithms  which  satisfy  these  objectives  have  been  developed, 
an  arithmetic  unit  encompassing  all  of  the  algorithms  must  be  specified. 
This  unit  should,  of  course,  be  real  world  implementable.   To  accomplish  this 
end,  the  unit  itself  must  conform  to  several  objectives  imposed  during  the 
logic  design  phase  of  research. 

OBJECTIVE  4:   The  unit  should  comply  with  the  design 
constraints  of  LSI  (Large  Scale  Integration) . 

OBJECTIVE  5:   The  unit  as  a  whole  should  have  conventional 
input/output  requirements;  i.e.,  the  operands  which  are 
input  and  the  results  which  are  output  should  be  in  a 
conventional  form  (e.g.,  two's  complement,  sign  magnitude, 
etc.) . 

OBJECTIVE  6:   The  unit  as  designed  should  be  modular  and 
expandable,  both  from  the  individual  chip  and  the  overall 
system  viewpoint.   It  should  be  designed  to  handle  floating 
point  numbers. 


OBJECTIVE  7:   The  unit  should  be  fast  as  compared  to  typical 
central  processor  and  memory  speeds. 

In  order  to  comply  with  the  fourth  objective,  the  unit  must  possess 
a  high  circuit  density,  regularity  of  structure,  a  low  pin  count,  and  a 
large  domain  of  applications.   The  fifth  objective  requires  that  the 
redundant  nature  of  the  algorithms  be  hidden  from  the  user.   The  result 
digits  generated  by  the  algorithms  must  be  recoded  into  a  conventional 
format  before  they  are  output.   By  designing  the  unit  to  function  either  in 
a  serial  mode  as  a  stand  alone  unit  or  in  a  pipelined  mode  in  connection 
with  other  identical  units,  the  sixth  objective  would  be  satisfied.   The 
last  objective  is  difficult  to  quantify  given  today's  rapidly  changing 
technology. 

1.2  Related  Work 

Several  of  the  well  known  basic  algorithms  satisfy  the  on-line 
property  with  respect  to  either  the  operands  or  the  results.   Consider,  for 
example,  conventional  division  which  has  the  on-line  property  with  respect 
to  the  quotient  digits.   Similarly,  conventional  multiplication  has  the 
on-line  property  with  respect  to  the  multiplier.   Several  authors  have 
extended  this  on-line  property  for  multiplication  to  the  product  digits 
as  well  [AV162,  PIS70,  GOY76]. 

It  is  also  possible  to  define  algorithms  conforming  to  a  right-to- 
left  type  of  processing;  i.e.,  algorithms  which  operate  from  least-to-most 
significant  end.   Conventional  addition  and  subtraction  process  operands 
and  produce  result  digits  in  this  manner.   Atrubin  [ATR65]  developed  a 
right-to-left  type  algorithm  for  multiplication.   But,  since  the  division 


process  must  by  its  very  nature  operate  in  a  left-to-right  fashion  for 
the  calculation  of  the  quotient,  this  left-to-right,  on-line  processing 
was  imposed  upon  all  of  the  algorithms.   The  most  significant  digit  first 
approach  is  also  consistent  with  other  arithmetic  processes  such  as  operand 
normalization,  mantissa  overflow  detection,  and  result  sign  determination. 
All  of  these  processes  inherently  require  examination  of  the  most  significant 
digits  of  the  result.   Thus,  in  this  thesis  the  on-line  process  is  defined 
to  be  that  process  in  which  all  of  the  operand  digits  as  well  as  the  result 
digits  flow  through  the  arithmetic  unit  in  a  left-to-right,  digit-by-digit 
fashion. 

To  fulfill  the  on-line  requirements,  a  set  of  left-to-right,  digit- 
by-digit  algorithms  had  to  be  derived.   The  existence  of  such  algorithms  is 
contingent  upon  the  use  of  a  redundant  representation  for  the  result.   In 
the  past,  redundant  number  representations  have  often  proved  useful  for 
speeding  up  arithmetic  operations  [MET57,  ROB58,  T0C58,  AV161,  AV162,  PEN62]. 

In  a  non-redundant  system,  even  simple  operations  like  addition  and  subtrac- 

f 
tion  possess  a  significant  on-line  delay  due  to  the  carry  propagation  which 

may  involve  the  full  precision  of  the  result.   By  allowing  redundancy  in  the 

number  representation,  it  is  possible  to  limit  the  carry  propagation  to  one 

(or  two)  digital  positions  [MET57,  ROH67,  BOR68,  ROB67].   Thus,  on-line 

algorithms  for  addition  (and  subtraction)  with  on-line  delays  of  at  most  one 


t 
Recall  that  the  on-line  delay  corresponds  to  the  number  of  digits  of  the 

operands  which  must  be  input  before  the  generation  of  correct  result  digits 

can  be  initiated. 


or  two  digits  can  be  easily  developed.   Campeau  [CAM70]  has  also  developed 
an  on-line  algorithm  for  multiplication  with  an  on-line  delay  of  one  digit. 
More  recent  work  in  the  area  of  on-line  computation  has  been  done  by 
Ercegovac  [ERC75,  TRI75].   He  developed  an  on-line  multiplication  algorithm 
which  combines  the  technique  of  incremented  multiplication,  as  used  in 
digital  differential  analyzers  [BRA63,  CAM70,  BAK75],  with  the  use  of  a 
redundant  number  system.   Ercegovac  also  proposed  an  on-line  division 
algorithm  in  his  work.   It,  however,  is  not  suitable  for  implementation 
(in  this  author's  viewpoint)  because  it  requires  excessive  scaling  of  the 
operands  initially  to  insure  convergence  of  the  algorithm.   The  existence 
of  a  reasonable  on-line  division  algorithm,  however,  has  not  been  at  all 
obvious.   The  first  attempt  at  deriving  a  reasonable  algorithm  for  division 
was  made  by  Trivedi  [TRI75]  in  the  Spring  of  1975.   In  his  algorithm  the 
on-line  delay  for  division  depends  upon  the  radix  and  other  properties  of 
the  number  system  used.   The  delay  is  generally  a  small,  positive  constant 
and  alleviates  the  problem  of  excessive  operand  scaling.   Division  methods 
presented  in  this  thesis  are  extensions  of  Trivedi' s  preliminary  work.   His 
algorithm  combined  with  parts  of  the  previously  mentioned  work  by  Ercegovac, 
led  to  the  evolution  of  a  set  of  compatible  algorithms  for  the  basic 
arithmetics. 

1.3  Number  Representation 

All  numerical  values  considered  in  this  thesis  are  assumed  to  be 
represented  in  finite  precision,  floating  point  format  with  a  representational 
error  of  |e|  <  r   where  m  is  the  precision  of  the  mantissa.   The  effect  of 
the  representational  errors  is  minor  and  necessitates  only  a  slight 


extension  of  the  precision  of  the  initial  data  to  obtain  the  required 
precision  of  the  results.   Thus,  the  representational  error  is  not  of 
immediate  concern.   It  is  assumed  that  the  precision  of  all  initial  data 
has  been  properly  adjusted  so  that,  for  a  given  precision  of  m  bits,  the 
data  can  be  regarded  as  exact. 

Consider  an  m  digit,  radix  r  fractional  component  of  a  floating 

point  number  N,  where 

m 
N  =  I   n.r"1  . 
1-1  X 

Using  a  conventional  representation,  each  digit  n.  can  assume  any  value  from 
the  digit  set  {0,1, . . . , (r-1) } .   Such  representations,  allowing  only  r  values 
in  the  digit  set,  are  non-redundant.   There  is  only  one  (unique)  representa- 
tion for  each  representable  number.   By  contrast,  number  systems  that  allow 
more  than  r  values  in  the  digit  set  are  redundant  and  allow  more  than  one 
representation  for  each  number.   See  Appendix  I  for  a  formal  definition  of 
redundancy.   The  scope  of  this  thesis  covers  those  cases  where  the  operands 
are  provided  in  a  conventional  format  while  the  results  are  generated  in  a 
redundant  format.   A  "mostly"  on-line  recoding  scheme  for  converting 
redundant  results  into  conventional  format,  as  given  in  Appendix  III,  is 
then  used  so  that  the  unit  will  satisfy  OBJECTIVE  5. 

1.4   The  Generalized  Procedure 

The  algorithms  developed  in  this  thesis  consist  of  the  following 
sequence  of  operations  whose  order  may  vary  slightly  from  one  specific 
function  to  another: 


1)  initialization  which  consists  of  waiting  for  sufficient 
digits,  corresponding  to  the  on-line  delay,   to  be 
input; 

2)  input  of  the  next  operand  digits; 

3)  an  addition  which  corrects  for  any  error  made  in  the 
previous  result  digit  selection  and  accounts  for  the 
new  operand  digits  just  received; 

4)  selection  of  the  next  result  digit  based  upon  a  limited 
precision  model;  and 

5)  a  completion  test  which  loops  back  to  step  2)  upon 
failure. 

From  the  above  sequence,  it  is  obvious  that  the  computational  algorithms 
are  step  invariant  with  the  only  primitive  operation  being  addition  as 
required  by  OBJECTIVE  3.   The  operands  are  input  according  to  OBJECTIVE  2 
and  the  results  are  generated  according  to  OBJECTIVE  1. 

1.5  Dissertation  Overview 

Chapters  2,  3,  and  4  present  compatible  on-line  algorithms  for 
division,  multiplication  and  addition/subtraction,  respectively.   The  on-line 
division  algorithm  is  of  primary  concern,  since  it  is  the  most  difficult  of 
all  the  algorithms  to  specify.   Once  division  is  specified,  compatible 
on-line  algorithms  for  the  other  functions  can  then  be  defined.   Each  chapter 


This  delay  is  shown  to  be  zero  for  addition,  subtraction,  and  multiplication 
and  is  a  small,  radix  dependent  constant  for  division. 


contains  necessary  background  material  on  the  function  in  question.   The 
on-line  algorithms  are  then  presented  along  with  their  convergence  conditions 
and  result  digit  selection  schemes.   Finally,  valid  ranges  for  the  operands 
are  derived  and  a  hardware  block  diagram  with  description  is  given. 

Chapter  5  addresses  the  question  of  implementation.   Suitability  to 
LSI,  floating  point  considerations,  hardware  requirements,  and  a  system's 
level  overview  are  discussed.   A  summary  of  the  results,  applications  for 
on-line  arithmetic  units,  and  the  possible  implications  of  such  a  device  are 
discussed  in  the  final  chapter. 
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2.   DIVISION 

Algorithms  which  satisfy  the  on-line  property  for  addition  and 
subtraction  can  be  easily  specified  [AVI61,  ROH67,  ROB67].  Multiplication 
requires  a  somewhat  more  elaborate  approach  [ERC75,  TRI75]  and  will  be 
discussed  in  detail  in  Chapter  3.   However,  the  existence  of  an  on-line 
division  algorithm  was  not  determined  until  a  first  attempt  at  such  an 
algorithm  was  made  by  Trivedi  in  the  spring  of  1975  [TRI75].   The  methods 
presented  in  this  chapter  are  extensions  of  this  preliminary  work.   The 
division  algorithm  must  be  of  primary  concern,  since  the  algorithmic  and 
logic  design  for  division  are  the  most  difficult  of  all  the  algorithms 
to  specify.   Compatible  algorithms  for  on-line  multiplication,  addition, 
and  subtraction  can  then  be  specified. 

2.1  Background 

In  designing  a  computer  arithmetic  unit,  division  is  the  most 
difficult  of  the  basic  operations  to  implement  efficiently.   Division  is 
inherently  a  trial-and-error  process  requiring  an  initial  guess  of  a  quotient 
digit  followed  by  a  comparison  (in  the  form  of  a  subtraction)  to  determine 
whether  this  guess  was  correct.   If  it  was  not,  the  initial  quotient  digit 
is  modified  and  the  process  is  repeated.   This  class  of  division  based 
upon  subtraction  can  be  defined  by  the  recursive  relationship 

P.  «-  rP.  ,  -  q.D  ,    j  =  l,...,m  (2.1) 

J     3-1    J 
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in  which 

P     is  the  dividend, 

P._1   is  the  partial  remainder  used  in  the  j   recursion, 

P     is  the  remainder, 
m 

j  is  the  recursion  index, 

q.  is  the  j    quotient  digit, 

D  is  the  divisor,  and 

r  is  the  radix. 

To  form  the  partial  remainder,  P.,  a  multiple  of  the  divisor  is 
subtracted  from  the  previous  shifted  partial  remainder.  The  determination 
of  which  multiple  of  D  to  subtract  is  dependent  upon  the  quotient  digit; 
but  it  is  precisely  this  quotient  digit  that  must  be  computed.   It  is  not 
known  a  priori.   As  it  stands,  this  recursion  relationship  for  division 
does  not  adequately  specify  how  q.  is  to  be  selected.  By  adding  the  range 
restriction  (which  is  intuitively  applied  when  doing  the  hand  calculation) 

|P  |  1  K+-|D|,  (2.2) 

j 

the  division  algorithm  becomes  completely  specified.   The  important  point 

here  is  that  division  not  only  requires  an  addition  or  subtraction  (as  in 

multiplication),  but  also  the  selection  of  a  quotient  digit  such  that  the 

value  of  the  new  partial  remainder  is  within  a  specified  range. 

On-line  division,  as  investigated  in  this  thesis,  is  yet  a  further 

complication  of  this  process  in  that  the  full  precision  of  the  operands, 


K.  will  be  defined  in  the  next  section.   It  is  sufficient  here  to  know 
that  1/2  <  K  <  1. 
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Pn  and  D,  is  not  available  for  comparison.   At  first  consideration  it  would 


t 


seem  that  on-line  division  is  impossible. 

By  allowing  the  quotient  digits  to  take  on  a  redundant '  representation, 
many  of  the  above  problems  which  are  seemingly  inherent  to  division  can  be 
resolved.   As  will  be  discussed  in  this  chapter,  redundancy  in  the  represen- 
tation of  the  quotient  permits  inspection  of  fewer  digits  of  the  operands 
in  the  selection  of  the  quotient  digits.   This  seems  to  be  intuitively 
correct,  since  without  redundancy  the  quotient  has  one  unique  representation 
and  thus  each  digit  of  that  quotient  must  be  selected  precisely.  It  should  be 
clear  that  without  redundancy  it  is  not  possible  to  avoid  a  set  of  full 
precision  comparisons.   But,  with  redundancy  the  selection  of  the  quotient 
digit  need  not  be  precise.   A  selection  based  upon  just  the  first  few  most 
significant  digits  of  the  divisor  and  partial  remainder  is  good  enough; 
i.e.,  the  selction  is  based  upon  a  limited  precision  version  of  the  operands. 
Thus,  by  using  redundancy  the  trial-and-error  nature  of  division  can  be 
avoided.   The  resultant  non-unique  representation  of  the  quotient  does, 
however,  complicate  the  division  in  that  the  redundant  form  must  eventually 
be  converted  to  a  conventional  representation.   See  Appendix  III  for  a 
description  of  the  result  digit  recoding  algorithm.   Note  that,  in  most 
cases,  this  recoding  algorithm  is  also  an  on-line,  most-to-least  significant 
process. 

In  the  case  of  on-line  division  it  should  be  immediately  evident 
that  in  the  absence  of  redundancy  the  problem  cannot  be  solved.  By  definition, 
during  on-line  processing  the  full  precision  of  the  operands  is  not  available 


See  Appendix  I  for  a  discussion  of  redundant  number  representations 
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for  selection.   Therefore,  the  quotient  digit  selection  must  be  based  upon  a 
limited  precision  estimate  of  the  divisor  and  partial  remainder.   Error  is 
also  introduced  into  the  on-line  division  process  when  calculating  the  new 

f~Vi 

partial  remainder.   During  the  j    recursion,  only  those  operand  digits 
which  have  been  received  prior  to  the  j    iteration  have  been  included  in 
the  computation  of  the  old  partial  remainder.   Part  of  the  j    calculation 
must  then  be  a  correction  factor  which  takes  into  account  the  effects  of  the 
new  dividend  and  divisor  digits  on  the  value  of  the  new  partial  remainder. 
Thus,  the  quotient  digit  selection  is  based  on  a  possibly  erroneous  partial 
remainder,  though  this  error  is  relatively  small.   Can  the  margin  of 
allowable  error  which  is  permitted  by  the  use  of  redundant  quotient  digits 
also  be  made  to  cover  this  extra  error?   If  so,  how  much  error  of  this  type 
can  be  tolerated?   Is  there  some  minimum  allowable  operand  precision  required 
in  order  for  on-line  division  to  proceed?   This  chapter  will  resolve  these 
questions  by  specifying  an  on-line  division  algorithm  and  the  conditions  on  it 

2.2  The  On-line  Algorithm 

For  floating  point  operation  each  number  X  is  of  the  form 

X  =  f  •  re 

where,  usually,  f  is  a  fraction  in  the  range 

1/r  <  |f|  <  1 

and  e  is  an  exponent.   The  arithmetic  for  the  fractional  parts  is  handled 
separately  from  the  arithmetic  for  the  exponents.   Thus,  two  arithmetic 
units  are  required,  one  for  fractions  and  one  for  exponents.   Design  of  the 
exponent  handling  unit  is  straightforward,  requiring  only  addition  and 
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subtraction.   However,  the  design  of  the  unit  to  handle  the  fractional 
arithmetic  is  nontrivial  and  it  is  the  algorithms  for  this  unit  which  will 
be  discussed  in  depth  in  this  thesis. 

Let  the  radix  r  representation  of  the  fractional  part  of  the 

t 
positive  dividend,  divisor,  and  quotient  be  denoted  by  N,  D,  and  R 

respectively,  such  that 

m 
N  =  I  n.r"1  , 
1=1   1 


m 
D  =   E   d.r  1 
1-1  X 


m 
R  =   E   q.r"1  , 
1-1   X 

and 

R  =  N/D 

to  m  digit  precision. 

Recall  that  in  an  on-line  environment  the  digits  of  the  dividend 
and  divisor  are  not  known  in  advance,  but  are  available  on-line,  digit-by- 
digit,  most  significant  digit  first.   These  operand  digits,  n  and  d.,  are 

i      i 

typically  members  of  a  conventional,  nonredundant  digit  set,  0,  such  that 

n.,  d.  e  {0,1,2, ... ,r-l} 
li 


The  result  (quotient)  is  denoted  by  R  so  as  to  be  compatible  with  the 
notation  used  in  the  other  algorithms  as  discussed  in  the  next  two 
chapt< 
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It  is  assumed  that  the  dividend  and  divisor  are  in  normalized  form  upon 
input  to  the  unit;  i.e., 

|  £  D,  N  <  1  . 

The  methods  presented  here  are  extendible  to  the  case  when  the  operands  are 

also  in  redundant  form  [ATK70]. 

Assume  that  the  first  quotient  digit,  q1 ,  can  be  properly  selected 

after  6  leading  digits  (the  on-line  delay  or  'index  difference')  of  the 

dividend  and  divisor  are  known.   Thereafter,  one  new  digit  of  the  quotient 

can  be  determined  upon  the  receipt  of  one  new  digit  each  of  the  dividend 

and  divisor.   Let  the  quotient  digits  be  members  of  a  symmetric  redundant 

digit  set,  0  ,  such  that 
P 

q±    £  {-p,-(p-l), ... ,1,0,1, ... ,p-l,p} 
where 

f  1  P  1  r_1  • 

The  degree  of  redundancy  will  be  denoted  by  K,  referred  to  as  the  redundancy 
coefficient,  where 

P 


K  = 


r-1 


Thus,  when  using  a  maximally  redundant  digit  set,  K  =  1.   When  K  =  1/2 
(i.e.,  p  =  (r-l)/2)  there  is  no_  redundancy  in  the  digit  set. 

The  partial  remainder  is  computed  via  a  limited  carry-borrow 
propagation  adder  [ROH67,  ROB67],  resulting  in  a  redundantly  represented 
partial  remainder.   A  limited  carry-borrow  propagation  adder  is  necessary 
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to  make  the  time  required  to  perform  the  recursive  step  independent  of  the 
precision  of  the  operands;  i.e.,  a  carry  free,  totally  parallel  addition  is 
possible.   Thus,  the  digits  of  the  partial  remainder  are  also  members  of  a 
redundant  digit  set,  £>  , ,  which  may  or  may  not  be  the  same  set  as  £>   the 
quotient  digit  set.   The  redundancy  coefficient  of  the  adder  is,  then,  K' . 

Given  these  definitions,  the  algorithm  DIVIDE   [TRI75]  which  is 
shown  on  the  next  page,  can  be  specified.   In  this  algorithm,  the  dividend 
and  divisor  are  assumed  to  be  padded  with  zero  digits  on  the  right  (least 
significant  end).   Note  that  the  basic  recursion,  (2.3),  is  more  complex  than 
that  of  the  standard  division  recursion,  (2.1),  due  to  the  corrective  action 
necessitated  by  the  operand  digits  arriving  on-line  during  each  iteration. 

The  convergence  of  the  algorithm  DIVIDE  can  be  established  as 
follows.   Using  the  basic  recursion  (2.3)  in  algorithm  DIVIDE,  the  following 
expression  for  the  on-line  version  of  the  partial  remainder,  P.,  can  be 
derived  by  induction  on  j . 

P.  =  r3[  Z  n.r"1  -  (  E   q.r  1) (  E  d.r  1) ]  (2.4) 

3  i=l  1       i=l  X  i=l  X 

which  implies  that,  as  j  -*■   m, 

P  -  rm[N  -  R«D] 
m 


so  that 


R  =  N/D  -  P  r  m/D 
m 


The  algorithm  is  so  structured  as  to  be  compatible  with  the  multiplication 
algorithm  specified  in  Chapter  3. 
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Algorithm  DIVIDE: 

Step  1  [Initialization]: 

6 

po  *    *    V    • 

1=1 

6 
0   i=l  dir   ' 
R0-0; 

j  +   0;  GO  TO  Step  4; 

Step  2  [Input  Digit]: 

D.  *■  D.  .  +  d._^r"j"6; 
3    3-1    3+5 

Step  3  [Basic  Recursion]: 

P.  «-  rP.  ,  -  q.D. 
3     3-1    J  3 

-6    B        -6 
+  n._^r   -  R.   d.  „r 
j+6       j-1  j+6 

(2.3) 

Step  4  [Selection]: 

q.  ,,  «■  SELECT (rP.,D.); 
3+1            3   3 

R. .,  «-  R. 
3+1    3 

+  w~j" 

Step  5  [Test]: 

IF  (j  <  m) 

THEN  j  '«-  j  +  1; 

GO  TO  Step  2; 
ELSE  END  DIVIDE; 

Therefore,  by  devising  a  quotient  digit  selection  procedure,  SELECT  in 
Step  4  of  algorithm  DIVIDE,  such  that  for  j  =  m 


|P. I  <  K«D 
3  - 


(2.5) 
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where 

\<    K  <  1  , 

then  R  =  N/D  can  be  computed  to  m  digit  precision.   Assuming  that  a 

selection  procedure  can  be  specified  which  generates  the  quotient  digit 

q...  while  guaranteeing  that  |P.,n|  <  KD  given  that  |P.|  <  KD,  then,  by 
j+1  j+±  —  j  — 

induction,  the  range  restriction  (2.5)  will  hold  for  all  values  of  j. 

(For  j=0,  (2.5)  can  be  satisfied  by  appropriately  preshifting  the  dividend 

as  explained  in  Section  2.4.)   Such  a  selection  procedure  is  derived  in  the 

next  section.   First  a  bound  will  be  established  on  P.  -  P.l  and  then  a 

J    3 

selection  procedure  will  be  developed  which  guarantees  that  |P.|  <_  KD. 

This  in  turn  will  give  a  bound  on  P.. 

J 

2.3  Quotient  Digit  Selection 

The  division  procedure  may  be  defined  graphically  with  a  construction 
suggested  by  C.  V.  Freiman  [FRE61].   The  basis  for  its  construction  is  the 
basic  recursive  relationship,  (2.3),  together  with  the  range  restriction, 
(2.5),  which  has  been  adjusted  to  include  the  error  introduced  by  on-line 
processing.   The  figure  is  essentially  a  plot  of  partial  remainder  versus 
divisor  values  and  is  thus  designated  a  P-D  plot.   By  analyzing  such  a 
plot,  a  quotient  digit  selection  procedure  can  be  fully  specified  for  a 
given  r,  p ,  and  6 . 

2.3.1  Range  Restriction  Analysis 

From  the  recursion  (2.1),  the  following  equation  can  be  derived  by 
induction. 

m     _ .     j  m 

P  =  rJ[  Z     n.r"1  -  (  I  q.r  x) (  I  d.r"1)]  .  (2.6) 

1      1-1         1-1  x     1-1  1 
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Subtracting  equation  (2. A)  from  equation  (2.6)  gives 

m  j      .     m 

P.  -  P.  =  rJ[    e    n  r"1  -  (  S  q  r"1) (    E   d  r   )  ]  .      (2.7) 
J    J      i=j+6+l         i=l       i=j+6+l 

Recall  that  P.  is  the  normal  full  precision  representation  of  the  partial 

remainder  and  that  P.  is  the  on-line  version  of  the  partial  remainder. 

J 

Thus,  equation  (2.7)  is  a  measure  of  the  error  introduced  at  a  particular 
step,  j,  by  using  the  on-line  algorithm.   Now,  determine  the  bounds  on  this 
error. 

UPPER  BOUND: 

m  j  _ .  m 

P     -  P      <  rJ[        i        (r-l)r-1   +   (    z    p  r    X)  (        E        (r-l)r-1)  ] 
J  J  i=j+<5+l  i=l  i=j+6+l 

_<  rJ[(r-l)(r  J  -   r  )  (^-)   + 

p(r       -   r  )  (— j-)  (r-1)  (r  J  -  r  )  (^- )] 

=  r"6(l+K)    -  r"m+j(l+K)    -  r"j"6K  +  r_mK 

Since  m  is  assumed  to  be  large  with  respect  to  5,  the  upper  bound  is 
certainly  less  than 

P  -  P <  (1+K)r~6   . 

LOWER  BOUND: 

j     _.     m 
P,  "  P,  >  -rJ[(  Z   Pr  X)(   l        (r-l)r  X)  ] 
J    J        i=l        i=j+6+l 

j  r  /  ~1    ~j-l\/  r  w   i\/  —  i— 6  —  1    -m-lx  /  r  N  , 
^-rJ[p(r   -rJ   )  (— j-)  (r-1)  (r  J     -r    )  (— j- )  ] 

=  -K(r   -r   J-rJ   +r) 
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And  since  m  is  assumed  to  be  large  with  respect  to  6 ,  the  lower  bound  is 

certainly  greater  than 

A  — (S 

P.  -  P.  >  -  Kr    . 
J    J  ~ 


Combining  the  above  results 


Kr     P.  -  P  <_   (1+K)r 


(2.8) 


Recall  from  equation  (2.2),  the  range  restriction  on  P . ,  that 

-  KD  <  P.  <  KD 


(2.9) 


From  equation  (2.8)  and  (2.9),  the  range  restriction  on  P.  becomes 

-6    "  -6 

-KD  +  Kr   ^  P.  <^  KD  -  (1+K)r 

Since  K  is  positive,  equation  (2.5)  is  satisfied  by  the  above  equation  for 
j  =  l,...,m  and  by  using  this  range  restriction  on  P.  to  define  the 
selection  procedure,  R  =  N/D  can  be  computed  to  m  digit  precision. 


2.3.2  The  Selection  Equations 

By  applying  the  range  restriction  (2.9)  on  P.,-,*  and  using  the 

(incremented)  recursion  relationship  (2.1),  the  selection  region  of  rP .  for 

each  possible  value  of  q.in  can  be  determined.   Let  q..,  -  i  such  that 

J+l  J+l 

i  £  p,  then  the  i-selection  region  guaranteeing  the  range  restriction 
<)  Is  given  by 


(-K+i)D  <_rP.  <  (K+i)D 


(2.10) 


The  corresponding  i-selection  region  for  rP .  is  obtained  using  equation  (2.8) 
and  (2.10).   Thus,  a  partial  definition  of  the  SELECT  function  given  in 
DIVIDE,  Si  i  [)  k   as 


21 


q._,,  «■  SELECT (rP.,D.) 
3+1  J   J 


becomes 


(-K+i)D  +  Kr  6~KL  i  rP.  1  (K+i)D  -  (1+K)r  6+1  (2.11) 

This  condition  can  be  graphically  described  by  means  of  a  P-D  plot, 

as  in  Figure  2.1.   (The  difference  between  this  P-D  plot  and  the  conventional 

P-D  plot  is  that  the  ordinate  is  rP .  instead  of  rP . . )   It  consists  of  a 

3  3 

family  of  curves  which  are  linear  functions  of  D  with  q.,-,  as  a  parameter 

ranging  from  -p  to  +p  in  steps  of  1.   The  area  between  the  maximum  rP .  and 

the  minimum  rP.  will  be  denoted  the  "q . , ,  =  i  region." 
3  3+1 

So,  for  a  given  base  (r) ,  redundancy  coefficient  (K) ,  and  index 
difference  (6)  the  division  procedure  can  be  specified  via  a  corresponding 
P-D  plot.   A  given  value  of  D.  and  rP .  will  correspond  to  a  point  in  an 
i-selection  region.   The  quotient  digit  q.,-.  is,  therefore,  i  and  is  used  in 
forming  the  next  partial  remainder. 

Figure  2.1  is  an  example  of  a  full  P-D  plot  with  r  =  4,  K  =  2/3,  and 

6=4.   The  equations  for  the  selection  lines  are  given  in  Table  2.1.   Note 

that,  as  a  consequence  of  redundancy  in  the  representation  of  the  quotient, 

there  is  an  overlap  between  each  adjacent  quotient  digit  selection  region. 

Some  values  of  rP .  and  D.  will  specify  a  point  for  which  either  q..,  =  i  or 
3      3  3+1 

q.  .  =  i-1  is  a  valid  choice.   It  is  this  overlap  which  permits  the  quotient 
digit  selection  to  be  made  on  the  basis  of  estimates  of  the  full  precision 
divisor  and  shifted  partial  remainder  and  thus  permits  on-line  division. 

By  tightening  the  lower  bound  of  the  selection  equation  (2.11)  to 
give  the  selection  equation 

(-K+i)D  +  (l+lOr"6"1"1  <_  rP.  <_  (K+i)D  -  (1+K)r"6+1  (2.12) 


2.66 


—  D 


2.66 


Figure  2.1  P-D  Plot  with  r=4,  K=2/3,  and  6=4 


Table  2.1   Equations  Defining  the  Selection  Regions  of  Figure  2.1 


Selection  Lines 


aa 


Selection  Equations 


UPPER (2 
LOWER (2 

UPPER (1 
LOWER (1 

UPPER(0 
LOWER (0 

UPPER (T 
LOWER (I 

UPPER(2 


2  5-3 

4P.    <    ("I  +  2)D  -  |  4 

2  2-3 

4P.    >    (-  |  +  2)D  +  |  4 


2  5-3 

4P.    <    (j  +  1)D  -  |  4 

2  2-3 

4P.    >    (-  |  +  1)D  +  |  4 


2  5-3 

4P.    <    (-|)D  -  |  4 

2  2-3 

4P      >    (-  f)D  +  §  4 


«j    «    (|  -  1)D  -  |  4"3 

2  2-3 

4P.    >    (-  |  -   1)D  +  |  4    J 


4P.    <    (|  -  2)D  -  |  4"3 

2  2-3 

4P.    >    (-  ^  "    2)D  +  -f  4 
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the  full  P-D  plot  becomes  symmetric  about  both  axes  as  shown  in  Figure  2.2 
with  r  =  4,  K  =  2/3,  and  6  =  4.  The  corresponding  selection  equations  are 
given  in  Table  2.2. 

Although  this  more  restrictive,  but  still  valid  equation  reduces  the 
overlap  regions  slightly,  this  reduction  is  more  than  compensated  for  by  the 
fact  that  all  of  the  quadrants  now  have  identical  (except  for  sign)  overlap 
regions.   Thus,  only  quadrant  I  need  actually  be  implemented.   This  small 
change  in  the  lower  bound  does  not  significantly  increase  the  complexity  of 
the  step  definition  for  the  quotient  digit  selection  (see  Section  2. 3. A). 

In  the  rest  of  this  chapter  we  will  restrict  our  attention  to  the 
first  quadrant  of  the  P-D  plot  defined  according  to  the  selection  equation 
(2.12).   (Representative  plots  are  collected  into  Appendix  II.) 

2.3.3  Determining  the  Minimum  Index  Difference 

Recall  that  an  initial  assumption — that  the  first  quotient  digit,  q. , 

can  be  properly  selected  after  6  leading  digits  of  the  dividend  and  divisor 

are  known — was  made.   Now  the  question  arises  as  to  what  is  the  minimum 

possible  value  for  6,  the  index  difference  for  division.   The  minimum  value 

for  6  is  desired  because  this  determines  the  initial  delay  time  before  the 

division  algorithm  can  start  producing  quotient  digits.   6  most  significant 

digits  of  the  dividend  and  divisor  must  be  available  initially.   Thus,  if 

one  memory  word  holds  4  operand  digits,  2»  [6/4]  memory  accesses  must  be 
made  prior  to  the  generation  of  the  first  quotient  digit. 


f 
Throughout  this  thesis  the  term  "word"  is  used  to  refer  to  the  width  of 

the  memory  (e.g.,  4,  8,  or  16  bits  for  microprocessors).   The  term  "digit" 

is  one  radix  r  digit  (e.g.,  one  digit  is  2  bits  for  radix  4).   Thus,  one 

word  may  consist  of  several  digits,  while  the  full  precision  operands  may 

conceivably  consist  of  several  memory  words — variable  precision — up  to 

some  hardware  limited  maximum. 
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-2.64 


Figure  2.2   Modified  Symmetric  P-D  Plot  with  r=4,  K=2/3,  and  6=4 
Table  2.2   Equations  Defining  the  Selection  Regions  of  Figure  2.2 


Selection  Lines 

UPPER(2 
LOWER (2 

UPPER(1 
LOWER (1 

UPPER(0 
LOWER (0 

UPPER (I 
LOWER (I 

UPPER(2 
LOWER (2 


!i±l 


Selection  Equations 


o  S   -3 

4P.  <_  (-|  +  2)D  -  |  4  J 

4P.  >  (-  |  +  2)D  +  |  4"3 


2         5-3 
4P.  <  (-|  +  1)D  -  |  4 

4P.  >  (-  |  +  1)D  +  |  4~3 


2      5-3 
4P.  <  (|)D  -  §  4 

4P.  >  (-  f)D  +  §  4"3 


2         5-3 
4P.  <  <§  -  1)D  -  |  4 

2         5-3 
4P.  >  (-  4  -  1)D  +  |  4  J 
J  ~    3         3 


2         5-3 
4P.  <  (j  -  2)D  -  |  4  J 

2        5-3 
4P.  >  (-  -|  -2)D  +  |  4  J 
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The  minimum  allowable  value  for  6  can  be  determined  by  requiring 
that  the  lower  bound  for  a  q.+,  -  i  selection  region  and  the  upper  bound  for 
the  corresponding  q.,,  =  i-1  selection  region  intersect  at  a  value  of 
D  <_  — ;  there  must  be  a  nonzero  selection  overlap  for  all  values  of  —  <^  D  <  1. 
Otherwise,  there  are  valid  regions  on  the  P-D  plot  where  the  quotient  digit, 
q.,-i»  is  undefined;  that  is,  the  division  algorithm  could  not  be  completely 
specified.   For  example,  consider  the  case  of  r  =  2,  K  =  1,  and  6  =  3  as 
shown  in  Figure  2.3.   The  shaded  area  is  a  valid  region  of  the  plot  where 
the  value  of  the  quotient  digit  is  undefined.   The  worst  case  occurs  when 
D  =  —  and  i  is  either  p  or  p-1,  the  selection  overlap  region  between  the 
lower  limit  of  p  and  the  upper  limit  of  p-1.   If  this  overlap  region  is 
non-null,  then  all  of  the  selection  overlap  regions  are  guaranteed  to  be 
non-null.   The  condition 

(-K4p)|  +  (1+K)r"6+1  <_   (K+p-l)|  -  (1+K)r"6+1 

must  hold.      Since  5    is   required   to  be  an  integer,    then 

Sa-r%    ^§5-1      .  (2.13) 

So,  for  the  case  of  r  =  2  and  K  =  1  it  is  required  that  6  >_  4.   Figure  2.4 
is  the  P-D  plot  for  the  specific  case  of  6  =  4. 

Looking  ahead  to  implementation,  given  the  pin  limitations  of  LSI 
(Large  Scale  Integration) ,  reasonable  values  for  the  number  of  bits  of  each 
operand  input  to  one  arithmetic  unit  (AU)  at  the  beginning  of  each  cycle  are 
4,  8,  or  16.   It  would  be  preferable  to  have  6  small  enough  so  that  the 
division  algorithm  could  proceed  after  only  one  memory  word  for  each  of  the 
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(-l+i)D  +  2-2~2  5  2P.  <  (l+i)D  -  2-2  2 


Figure  2.3  P-D  Plot  with  r=2,  K=l,  and  6=3 
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**D 


(-l+i)D  +  2-2  3  <  2P  <  (l+i)D  -  2-2  3 


Figure  2.4  P-D  Plot  with  r=2,  K=l,  and  5=4 


28 


two  operands  had  been  input  (i.e.,  a  delay  of  only  two  memory  accesses).   On 
the  other  hand,  increasing  6  increases  the  size  of  the  overlap  region  and, 
thus,  simplifies  the  quotient  digit  selection.   A  compromise  must  be  made. 

In  the  binary  case,  r  =  2,  a  digit  is  one  bit,  so  a  convenient  choice 
for  6  would  be  4,  8,  or  16  respectively,  depending  on  the  number  of  bits 
input  to  the  AU  during  each  cycle.   For  base  4,  r  =  4,  a  digit  consists  of 
2  bits,  so  6  should  be  2,  4,  or  8  respectively.   Similarly  for  base  16. 
Some  of  the  above  choices  must  be  eliminated  because  they  do  not  meet  the 
restriction  on  minimum  6 .   See  Appendix  II  for  some  representative  P-D 
plots  with  6  values  as  discussed  above. 

2.3.4  The  Model  Division 
Preliminary  Remarks 

As  stated  in  Section  2.1,  the  advantage  of  using  redundant  quotient 
digits  is  that  it  eliminates  the  trial  and  error  nature  of  division.   Using 
redundant  quotient  digits  permits  the  selection  to  be  based  upon  a  limited 
precision  model  of  the  operands,  thus  circumventing  the  need  for  a  full 
precision  comparison.   Sufficient  background  has  now  been  given  to  permit 
a  complete  definition  of  the  SELECT  function  of  algorithm  DIVIDE  resulting 
in  a  limited  precision  division  model. 

The  limited  precision  model  is  a  device  which,  when  given  estimates 
of  the  divisor  and  shifted  partial  remainder  of  sufficient  precision,  will 
output  a  quotient  digit  such  that  restriction  (2.12)  is  satisfied.   Thus  the 
model  must  be  able  to  select,  given  only  an  estimate  of  the  operands,  the 
correct  quotient  digit  values.   If  the  point  corresponding  to  the  values  of 
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rP .  and  D.  falls  in  an  overlap  region  of  the  P-D  plot,  the  model  must  make 
J      J 

a  choice  between  two  adjacent  quotient  digit  values.   It  must  take  into 

account  the  error  incurred  by  the  limited  precision  inputs.   While  making  the 

selection  based  upon  these  inputs,  it  must  guarantee  that  the  quotient  digit 

selected  is  also  valid  for  the  full  precision  values. 

The  selection  procedure  can  be  visualized  as  a  series  of  steps 

spanning  the  overlap  regions.   By  comparing  the  values  of  the  estimates  to 

these  steps,  the  appropriate  quotient  digit  can  be  selected.   If  the  values 

of  rP .  and  D.  correspond  to  a  point  lying  on  or  above  the  step,  the  larger 

quotient  digit  is  selected.   While  if  the  point  lies  below  the  step,  the 

smaller  quotient  digit  is  selected.   Sometimes  the  step  is  one  simple 

comparison  constant  for  rP .  or  "tread"  which  spans  the  entire  overlap  region 

from  D.  =  —  to  D.  =  1.   In  this  case,  the  quotient  digit  can  be  selected 
J    2     j 

based  merely  upon  the  value  of  the  shifted  partial  remainder  and  is 
independent  of  the  value  of  the  divisor!   But,  more  often,  due  to  the 
steepness  and  narrowness  of  the  overlap  region,  the  step  consists  of  a 
connected  series  of  "treads"  and  "risers"  which  span  the  region.   Here,  the 
risers  define  the  divisor  limits  for  which  the  corresponding  tread 
comparison  constant  for  rP .  is  valid.   See  Appendix  II  for  some  typical 
steps . 

The  steps  should  be  chosen  such  that  the  simplest  and  fewest 
comparisons  need  be  made.   These  steps,  in  turn,  are  dependent  upon  the 
precision  of  the  model.   The  comparison  constant  values  for  the  steps  must 
be  numbers  which  are  representable  given  the  amount  of  precision  required 
by  the  model.   Thus,  before  the  steps  can  be  "defined,  sufficient  precision 
must  be  determined. 
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Sufficient  Precision 

Assume  that  sufficient  precision  corresponds  to  the  use  of  a  most 

significant  digits  of  rP .  and  3  most  significant  digits  of  D..   Thus,  a 

digits  of  rP.  and  3  digits  of  D.  are  used  as  inputs  to  the  limited  precision 

division  model.   Denote  these  truncated  estimates  as  rP .  and  D.  correspondingly, 

3      3 

Recall  that  only  6+j  digits  of  each  operand  are  known  at  Step  j.   Then, 
obviously, 

3  1  6  . 

Denote  the  maximum  error  introduced  by  this  truncation  into  the 
representation  of  the  operands  as  Arp  and  Ad.   Then  Arp  and  Ad  are  defined  by 

I rP.  -  rP. I  <  Arp 

and 

D.  -  D.  <  Ad   . 
J    J  - 

Since  the  3  most  significant  digits  of  D.  are  invariant  for  iterations 

J 

3  <  j  <  m,  the  value  of  D,  is  a  constant  or  just 

D  «-  D. 
J 

and,  for  base  r 

Ad  =  r"  e  .  (2.14) 

And,  since  the  partial  remainder  is  computed  via  a  limited  carry-borrow 
propagation  adder  and  is,  therefore,  in  redundant  notation, 

Arp  =  K»r"a+1  (2.15) 

where  K'  is  the  redundancy  coefficient  of  the  adder. 
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The  conditions  for  determining  the  smallest  possible  a  and  3  can  De 
found  by  investigating  the  worst  case  (steepest  and  narrowest)  overlap 
region  of  the  P-D  plot,  that  is  when 

and 

i  =  P  • 

See  Figure  2.5.   Sufficient  precision  of  rP.  and  D.,  as  represented  by  their 

truncated  estimates  rP .  and  D,  is  insured  if  a  selection  step  can  be  defined 

J 

in  this  worst  case  region.   Then,  steps  for  the  rest  of  the  overlap  regions 
can  be  found  and  the  model  division  completely  specified. 

Figure  2.5  shows  the  upper  selection  limit  for  p-1,  UPPER(p-l), 

rP.  <  (K+p-l)D  -  (1+K)r"6+1  (2.16) 

and  the  lower  selection  limit  for  p,  LOWER(p), 

-cS+1 
rP.  >  (-K+p)D  +  (1+K)r  (2.17) 

near  D  =  — .  An  estimate,  rP . ,  falling  in  this  region  and  resulting  in  the 
selection  of  the  maximum  quotient  digit,  q.,,  =  p»  must  meet  the  following 
constraints 

rP.  -  Arp  >  (-K4o)(-|+Ad)  +  (1+K)r"6+1  (2.18) 


and 


rP.  <_   (K4p-l)|  -  (1+K)r"6+1  (2.19) 


for  the  selection  to  also  be  valid  for  the  full  precision  operand,  rP . 
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Equation  (2.19) 


v- 
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UPPER  (p-1) 


LL(p) 


Equation  (2.18) 


D  =  D.  MIN 


Ad 


D.  MAX 


LOWER (p) 


-►  D 


Figure  2.5  P-D  Plot  Showing  the  Worst  Case 
Overlap  Region 


33 


Thus,  the  dotted  line,  LL(p),  in  Figure  2.5  defines  an  absolute 
lower  limit  for  a  tread  value,  rP . ,  which  would  result  in  a  quotient  digit 
selection  of  q..,  =  p.   A  similar  lower  limit  exists  for  each  overlap  region 
and  will  be  denoted  as  LL(i).   The  upper  limit  on  the  possible  tread  values 
is  obviously  the  upper  limit  of  the  corresponding  overlap  region.   Thus, 
the  treads  (and,  hence,  the  risers)  of  each  stair  case  must  be  fully 
contained  between  the  lines  LL(i)  and  UPPER(i-l)  in  each  appropriate  overlap 
region.   These  treads  and  risers  should  assume  the  simplest  possible  binary 
values  which  conform  to  the  limits  and  the  precision  requirements. 

The  minimum  values  of  a  and  3  can  now  be  empirically  defined. 
Subtracting  equation  (2.18)  from  (2.19)  gives 

Arp  <  K  -  j  +   (K-p)Ad  -  2(1+K)r"6+1   . 

Substituting  the  values  of  Arp  and  Ad,  (2.14)  and  (2.15),  into  the  above 
equation  gives 


K'r~a+1  +  K(r-2)r  3  <  K  -  |  -  2(1+K)r  6+1  .  (2.20) 


For  a  given  base  (r),  redundancy  coefficient  of  the  quotient  (K) , 
index  difference  (6),  and  redundancy  coefficient  of  the  adder  (K'), 
interdependent  a  and  3  values  can  be  defined  using  equation  (2.20).   Recall 
that  a  represents  the  number  of  digits  in  rP .  which  are  redundant.   Thus,  an 
attempt  should  be  made  to  minimize  a  even  at  the  cost  of  increasing  3  to 
the  maximum  allowed  value  of  6.   The  flowchart  of  the  program  used  to 
determine  near  minimal  values  for  a  and  3  is  given  in  Figure  2.6. 
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NO 
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+  1 


(return) 


Figure  2.6  Flowchart  for  Determining  a  and  3 
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Definition  of  the  Steps 

Once  "good"  minimum  values  for  a  and  3  have  been  chosen  for  a  given 
set  of  constants  (r,  K,  6,  and  K'),  steps  can  be  defined  in  each  overlap 
region.   The  value  of  a  limits  the  maximum  precision  allowed  in  the  specifica- 
tion of  the  treads  and  3  the  maximum  precision  allowed  in  the  specification  of 
the  risers.   Recall  that  the  treads  and  risers  must  conform  to  the  upper  and 
lower  limits,  UPPER(i-l)  and  LL(i),  in  each  overlap  region. 

Think  of  the  overlap  region  between  q..,  =  i  and  q,,,  =  i-1  as  a 

j+1  3+1 

grid  of  vertical  spacings,  Ad,  and  horizontal  spacings,  Arp.   The  set  of  all 
boundaries  in  this  overlap  region  is  all  stairsteps  which  can  be  drawn  along 
these  grids  while  remaining  inside  the  upper  and  lower  limits.   See  Figure  2.7, 
As  Ad  and  Arp  are  decreased  (i.e.,  a  and  3  are  increased)  the  number  of 
different  possible  boundaries  increases  exponentially.   The  boundary, 
stairstep,  which  results  in  the  simplest  and  fewest  comparisons  for  the 
selection  of  q.,-,  should  be  chosen.   Some  possible  choices  are  shown  in 
Figure  2.8.   These  comparison  constants  are  then  used  to  define  the  quotient 
digit  selection  function,  SELECT,  of  algorithm  DIVIDE. 

The  flowchart  for  the  program  used  to  choose  a  "good"  stairstep  in 
each  overlap  region  is  given  in  Figure  2.8.   Little  or  no  attempt  was  made 
to  minimize  the  comparison  constants  of  one  overlap  region  in  relation  to 
another.   See  the  work  of  Atkins  [ATK67,  ATK68,  ATK70]  for  a  more  detailed 
analysis . 

Appendix  II  contains  some  sample  P-D  plots  with  stairsteps  and 
examples  of  the  algorithm  DIVIDE  corresponding  to  them.  The  steps  were 
chosen  according  to  the  algorithm  defined  in  the  flowchart  of  Figure  2.8. 
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Figure  2.7   Selecting  a   "Good"  Staircase 
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Figure  2.8   Step  Definition  Flowchart 
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The  binary  values  of  the  steps  (i.e.,  the  comparison  constants)  are  given  in 
the  righthand  (rP.)  and  top  (D)  margins  of  the  plots  shown  in  Appendix  II. 

2.4  Valid  Operand  Ranges 

By  investigating  the  P-D  plots,  it  becomes  obvious  that  the  initial 
operands  must  be  restricted  in  the  range  of  values  they  can  assume.   To 
insure  the  convergence  of  algorithm  DIVIDE,  equation  (2.5)  must  be  satisfied 
for  j  =  0.   This  may  require  an  initial  preshifting  of  the  dividend.   As 
stated  in  Section  2.2,  both  operands  are  assumed  to  be  in  normalized  form 
upon  input  to  the  arithmetic  unit;  i.e., 

4  1  D,  N  <  1  . 

As  seen  by  looking  at  the  plots,  only  D  is  required  to  be  in  normalized  form 
for  the  division  algorithm  to  be  defined.   The  allowable  range  for  N  must 
now  be  determined. 

When  the  6  most  significant  digits  of  the  dividend  are  shifted  to 
become  the  first  shifted  partial  remainder,  rP  ,  prior  to  quotient  digit 
selection,  it  is  conceivable  that  this  shifting  results  in  an  rP_  which  is 
out  of  range.   That  is,  it  corresponds  to  a  point  on  the  P-D  plot  for  which 
no  quotient  digit  value  is  defined.   In  other  words,  if 

rPQ  >  (K+P)D  -  (1+K)r"6+1 

then  rP.  is  out  of  range  and  must  be  scaled  before  division  can  proceed. 
For  rP   to  be  valid  it  must  conform  to  the  bounds 

(-K-p)D  +  (1+K)r"6+1  <  rPQ  <  (K+p)D  -  (1+K)r~6+1   . 
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Looking  at  the  worst  case  values  on  the  upper  bound  and  assuming  minimal 
redundancy,  this  implies  that 

or  just 

rp  <        f2   _   3r-2  -6+1 
0  -  4(r-l)    2(r-l)r 

2 
If  the  term  involving  6  is  assumed  to  be  negligible  with  respect  to  the  r   term, 

then  P_  must  fall  below  the  value 


1      r      1 
4  <  4(r-l)  -  2 


r  =  2,3,. .. 


Since  P_  is  input  in  normalized  form,  shifting  P  one  bit  to  the  right  will 
guarantee  that  rP_  will  be  within  the  allowable  range  limits.   A  correction 
on  the  quotient  which  consists  of  shifting  R  one  bit  to  the  left  must  then 
be  made. 


2.5  Hardware  Block  Diagram 

This  chapter  has  completely  defined  the  algorithm  DIVIDE.   Specifically, 
the  requirements  on  the  index  difference,  6,  and  the  quotient  digit  selection 
function,  SELECT,  have  been  given.   In  this  section  a  variable  radix  block 
diagram  implementing  the  DIVIDE  algorithm  will  be  examined.   At  this  point, 
only  the  major  components  of  the  arithmetic  unit  (AU)  for  DIVIDE  will  be 
discussed.   Lower  level  details  for  actual  implementation  will  be  developed 
in  Chapter  5. 

Figure  2.9  is  a  block  diagram  of  the  AU  for  performing  division. 
It  is  so  structured  as  to  be  compatible  with  multiplication,  addition,  and 
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subtraction  with  only  minor  modifications.   The  major  component  of  the  AU  is 
a  full  width   multi- input  limited  carry-borrow  propagation  adder.   The 
adder  is  discussed  in  detail  in  Chapter  5.   In  many  practical  applications 
the  number  of  inputs  to  the  adder  is  rather  small. 

The  quotient  digit  selector  is  a  table  look  up  device  which  implements 
the  SELECT  function.   It  examines  a  most  significant  digits  of  rP . ,  rP.,  and 
3  most  significant  digits  of  D.,  D,  in  order  to  select  the  appropriate 
quotient  digit,  q.,-,- 

The  rest  of  the  unit  consists  mainly  of  registers  and  selectors.   Two 

full  width  double  bank  registers  are  required  for  the  storage  of  the  quotient, 

R,  and  the  partial  remainder,  P.,  because  they  are  in  redundant  form.   The 

selection  network  must  be  capable  of  forming  the  required  multiples  of  D.  and 

R.  . .   A  carry  generator  would  be  needed  if,  for  example,  a  radix  complement 

representation  of  negative  numbers  is  used.   In  that  case,  the  selection 

network  must  also  be  able  to  form  the  complement  of  the  possible  multiples 

of  D.  . 
J 

The  complexity  of  the  selection  network  increases  for  higher  radices, 
and  since  the  additional  multiples  appear  as  inputs  to  the  adder,  the 
complexity  of  the  adder  would  also  be  increased.   Thus,  a  higher  radix,  while 
reducing  the  number  of  steps  per  cycle,  does  increase  the  complexity  of  the 
arithmetic  unit.   Chapter  5  addresses  the  problem  of  finding  an  optimal  radix 
while  considering  both  the  compexity  of  the  adder  and  selection  network  and 
the  complexity  of  the  result  digit  selector  across  all  of  the  operations 
(/,  *,  +,  and  -). 


T 

The  term  "full  width"  implies  that  the  adder  can  process  a  full  precision 

operand  (i.e.,  several  memory  words)  during  one  cycle  and  the  registers  can 

store  one  full  precision  operand  each.   Thus,  the  adder  and  register  widths 

set  a  hardware  upper  limit  on  the  maximum  allowed  precision. 
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3.   MULTIPLICATION 

Once  the  on-line  algorithm  for  division  has  been  specified,  com- 
patible algorithms  for  multiplication,  addition,  and  subtraction  can  be 
defined.   This  chapter  presents  an  on-line  multiplication  algorithm  which 
has  its  roots  in  work  done  by  Ercegovac  [ERC75,  TRI75].   It  combines  the 
well-known  technique  of  incremented  multiplication,  as  used  in  digital 
differential  analyzers  [BRA63,  CAM70,  BAK75],  with  the  use  of  redundant 
number  systems.   The  algorithm  is  so  structured  as  to  be  compatible  with 
the  division  algorithm  specified  in  Chapter  2. 

3.1  Background 

In  multiplication,  a  product  is  accumulated  by  the  successive 
addition  of  multiplies  of  the  multiplicand  to  a  partial  product.   Unlike 
division,  the  selection  of  which  multiple  to  add  is  dependent  upon  a  known 

quantity  (i.e.,  a  digit  of  the  multiplier).   Thus,  multiplication  can  be 

f 
defined  by  the  recursion  relationship 

P.  «-  rP.  .  +  y.X,   j  =  l,...,m  (3.1) 

J     J-l    3 


in  which 


P_    is  zero, 

P.    is  the  partial  product  used  in  the  j    recursion, 


t 
Recall  that  operations  proceed  from  most-to-least  significant  digit  as 

required  for  division. 
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P  is  the  product, 
m 

j  is  the  recursion  index, 

y.  is  the  j   multiplier  digit, 

X  is  the  multiplicand,  and 

r  is  the  radix. 


To  form  the  new  partial  product,  P.,  a  multiple  of  the  multiplicand 
is  added  to  the  previous  shifted  partial  product.   Exactly  which  multiple 
to  add  is  dependent  upon  a  known  multiplier  digit.   Thus,  many  of  the 
problems  encountered  in  division  are  not  present  in  the  design  of  the 
multiplication  algorithm. 

In  converting  multiplication  to  an  on-line  process,  two  complica- 
tions arise.   First,  the  recursion  relationship  must  be  restructured  to 
take  into  account  the  on-line  nature  of  the  operands.   During  the  j 
recursion,  only  those  operand  digits  which  have  been  received  prior  to  the 
j    iteration  can  be  included  in  the  calculation. 

Secondly,  if  a  nonredundant  number  system  is  used  in  representing 
the  partial  product,  the  digits  of  the  desired  product  appear  in  a  right-to- 
left  (least-to-most  significant)  fashion,  as  determined  by  the  conventional 
carry  propagation  requirements.   If,  however,  redundancy  is  used  in  the 
representation  of  the  product,  the  desired  on-line,  most-to-least  significant 
generation  of  the  product  digits  can  be  provided.   Here,  again,  the  redundant 
product  must  eventually  be  converted  to  a  conventional  representation.   See 
Appendix  III  for  a  description  of  the  on-line  recoding  algorithm. 

This  chapter  will  develop  in  detail  the  methods  used  to  alleviate 
these  complications  and  specify  a  compatible  on-line  multiplication  algorithm 
and  the  conditions  on  it. 
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3.2  The  On-line  Algorithm 

Let  the  radix  r  representation  of  the  fractional  part  of  the  positive 
multiplicand,  multiplier,  and  product  be  denoted  by  X,  Y,  and  R  respectively, 

such  that 

m 
X  =  S   x.r  1  , 
i-1   X 


m 
Y  =  E   y.r  X  , 
i=l 


m 
R  =  E   p.r  1  , 
i=l  X 


and  R  =  X  •  Y 

to  m  digit  precision. 

Recall  that  in  an  on-line  environment  the  digits  of  the  multiplicand 
and  multiplier  are  not  known  in  advance,  but  are  available  on-line,  digit-by- 
digit,  most  significant  digit  first.   The  operand  digits,  x.  and  y.,  are 
typically  members  of  a  conventional,  nonredundant  digit  set,  0,  such  that 

x.,  y.  e  {0,1,2, .. .r-1}  . 
i  J  i 

It  is  assumed  that  the  multiplicand  and  multiplier  are  in  normalized  form; 

i.e., 

-  <  X,  Y  <  1  . 
2  ~ 

Define  the  j -digit  representation  of  the  on-line  operands  X  and 

Y  as 

X.=   Z   x4r   =  X.  n  +  x.r  J 
1    i-1   i       1-1    J 
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where  X  =  0,  and 


3     -i  -i 

Y.  =  Z      y.r   =  Y.  ,  +  y.r  J 

J    i=1   i       J-l    "3 

where  Y  =  0.   The  corresponding  partial  product  is,  then, 

X.Y.  *■   X.  .Y.  ,  +  (X.  ,y.  +  x.y.r"3  +  Y.  .xjr^ 
J  J    J-l  J-l     J-l  J    J  J       J-l  J 

which  can  be  rewitten  as 

X.Y.  +   x.  ,'Y.  .  +  (X.y.  +  Y.  .xjr"^  . 
J  J    J-l  J-l     J  J    J-l  J 

Therefore,  if  P.  is  the  scaled  partial  product, 

P.  =  X.Y.r3  (3.2) 

J    J  J 

a  recursion  relationship  for  multiplication  which  takes  into  account 

the  on-line  nature  of  the  operands  can  be  expressed  as 

P.  «-  rP.  .  +  X.y.  +  Y.  -x.  (3.3) 

J     J-l    J  J    J-l  J 

But,  this  relationship  does  not,  as  it  stands,  generate  the  result  digits  in 

an  on-line  fashion.   The  product  digits  would  become  available  from  the 

least-to-most  significant  end  of  P  ,  as  determined  by  the  traditional  carry 

m 

propagation  requirements. 

Assume  that  the  product  digits,  p.,  can  be  selected  on-line  using  a 
recursion  relationship  similar  to  (3.3).   One  new  digit  of  the  product  could 
then  be  determined  upon  the  receipt  of  one  new  digit  each  of  the  multiplicand 
and  multiplier.   In  multiplication  the  index  difference,  6,  is  identically  1. 
That  is,  only  one  digit  of  each  of  the  operands  is  needed  initially  to 
select  the  first  result  digit.   Let  the  product  digits,  p.,  be  members  of 
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the  same  symmetric  redundant  digit  set,  0  ,  as  defined  for  the  quotient 
digits  in  Chapter  2. 

The  partial  product  is  computed  via  the  same  limited  carry-borrow 
propagation  adder  used  to  generate  the  partial  remainder  during  division. 
Thus,  the  digits  of  the  partial  product  are  members  of  the  redundant  digit 

set,  0  ' . 

P 

Given  these  definitions,  the  algorithm  MULT  [TRI75]  which  is  shown 
on  the  next  page,  can  be  specified.   In  this  algorithm,  recursion  (3.3), 
which  allows  for  on-line  processing  of  operands,  has  been  altered  to  form 
the  basic  recursion,  (3.4),  providing  on-line  generation  of  result  digits. 
The  selection  procedure  for  result  digits,  as  shown  in  the  next  section, 
corresponds  to  a  simple  rounding  on  P..   Thus,  the  basic  recursion  includes 
a  "correction"  on  P.  n  to  take  into  account  the  previous  selection  of  p.  . . 

From  equation  (3.3)  and  the  basic  recursion  equation,  (3.4),  in 
algorithm  MULT,  the  following  relation  can  be  obtained  by  induction. 

P.  =  P.  -  E  p,rJ    . 
J    J   i=l  i 

This   implies   that,    as  j  -*  m, 

m-1 
n  r>  *-  m-i 

P     =  P     -     E      p.r 
mm        ...      l 
i=l 

or  rearranging 

m-1 

P  =  P  +  rm  E   p.r  1  .  (3.5) 

m    m      i=1   i 

Since,  by  equation  (3.2), 


P  =X   •  Y   •rm=X-Y-rm, 
m    m    m 
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Algorithm  MULT 

Step  1 

[Initialization]:    PQ  +  0;  XQ  <■   0;  Y  «-  0; 

R0  -  0;  Pq  «■  0; 

j  *■  i; 

Step  2 

[Input  Digit]:    X.  <-   X.  .  +  x.r  J; 
L  v            6         J    J-l    J 

Y.  «-  Y.  .  +  y.r~J; 
J    3-1    J 

Step  3 

[Basic  Recursion]: 

P  *  r(P.  ,  -  p; :■•_•) -+  X.y.  +  Y.  nx. 
j      j-l   Fj-ly    j7j    j-l  J 

(3.4) 

Step  4 

[Selection]:    p.  i-  SELECT(P.); 
R.  *-  R.  1  +  p.r"3  ; 

Step  5 

[Test]:     IF  (j  <  m) 

THEN  j  *•  j  +  1; 

GO  TO  Step  2; 

ELSE  END  MULT; 

equation    (3.5)    can  be   rewritten  as 


which   is   just 


so   that 


m-1 

X   •   Y    •    r     =  P     +r        E      p.r 
m  .    ,      i 

i=l 


m 

X   •   Y  =     £      p.r"1  +   (P     -  p   )r"m   , 
.    ,    ri  mm 

1=1 


m 

R  =      E      p.r_1   =   X   •    Y  -    (P      -p   )r~m 
.    n    ri  mm 

1=1 
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By  devising  a  product  digit  selection  procedure,  SELECT,  in  Step  4 
of  algorithm  MULT,  such  that 

|P  -  P  |  <  K  (3.6) 

1  m    m1  — 

where  the  redundancy  coefficient,  K,  satisfies 

\<   K<  1  , 

then  R  =  X  •  Y  can  be  computed  to  m  digit  precision.   In  the  next  section 
such  a  selection  procedure  will  be  derived  so  as  to  guarantee  convergence  of 
the  algorithm. 

The  algorithm  as  it  stands  produces  just  the  most  significant  half 
of  the  product.  The  least  significant  half  of  the  product  is  available  as 
the  redundant  output  of  the  adder  after  iteration  m  +  1;  i.e., 

m+1     m    m 

By  feeding  these  redundant  adder  digits  directly  into  the  recoding  unit,  the 
least  significant  half  of  the  product  can  also  be  output  in  conventional 

form. 

3.3  Product  Digit  Selection 

In  order  to  implement  the  algorithm  MULT,  a  product  digit  selection 
procedure,  SELECT  of  Step  4,  must  be  devised  such  that  restriction  (3.6)  is 
satisfied.   For  j  =  1,  the  range  restriction  can  be  satisfied  by  appropri- 
ately preshifting  the  operands  as  explained  in  Section  3.4.   Assuming  that 
there  is  a  selection  procedure  that  generates  the  product  digit  p.  so  as 
to  guarantee  that 

|P.  -  p. |  <  K 
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given  that 

then  by  induction  the  range  restriction  (3.6),  assuming  that  the  operands 
conform  to  certain  bounds  as  derived  in  Section  3.4,  will  hold  for  all 
values  of  j.   Obviously,  by  performing  a  "simple"  rounding  on  P.  and  using 
the  integer  part  of  this  rounded  P.  to  represent  the  product  digit,  p., 

|P.-p. |  <  K  . 

The  remainder  of  this  section  will  fully  specify  an  implementable  rounding 
procedure  for  the  selection  of  the  product  digits. 

3.3.1  The  Selection  Function 

Define  the  product  digit  selection  function  [TRI75]  to  be 

''sign  p  *l|p  |+ij 


for  I  P.  j  <_   p  , 


p.  «-  SELECT(P.)  =   \ 


(3.7) 


sign  P.*l|P. | J 
otherwise 

This  represents,  for  all  practical  purposes,  a  rounding  procedure  which  has 
been  modified  at  the  end  points  of  the  domain  to  avoid  product  digit  values 
greater  than  p.   Thus,  the  selection  process  itself  can  be  carried  out  in 
a  deterministic  fashion;  that  is,  the  product  digit  selected  by  the  procedure 
is  simply  the  integer  part  of  the  rounded  partial  product,  P.. 

However,  the  partial  product  is  in  redundant  form.   Thus,  a  "simple" 
rounding  would  require  a  full  precision  carry  propagation  of  P.  in  order  to 
determine  its  magnitude.   This  must  be  avoided  if  at  all  possible.   An 
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ideal  selection  procedure  is  one  in  which  the  time  necessary  to  per- 
form the  selection  is  independent  of  the  precision  of  the  operands 
(i.e.,  step  invariant). 

By  devising  a  graphical  representation  for  the  selection  process, 
the  problem  can  be  more  easily  understood  and  solved.   Figure  3.1  is  a  plot 
of  the  partial  product,  P.,  versus  the  remaining  partial  product  after 
rounding,  P. -p.,  and  will  be  designated  a  P-p  plot.   By  analyzing  such  a 
plot,  a  product  digit  selection  procedure  based  upon  a  limited  number  of 
leading  digits  of  P.  can  he  fully  specified  for  a  given  r  and  p,  resulting 
in  a  precision  independent  selection. 

In  Figure  3.1  each  line  corresponding  to  a  different  product  digit 
value  will  be  called  a  "p-line."  The  bound  on  the  remaining  partial  product 
after  rounding, 

J  3  '  - 

t   - 
determines  the  maximum  allowable  value  for  the  partial  product  ,  P.;  that  is, 

|  P .  |  <_  K  +  p  =  rK 

in  order  for  a  product  digit  to  be  properly  selected.   Thus,  the  maximum 

value  for  P.  is  rK  which  occurs  at  the  intersection  of  (P. -p.)  =  K  and  the 
3  3      3 

p-line,  p.  =  p.   Similarly,  the  minimum  value,  -rK,  must  occur  at  the 

intersection  of  (P. -p.)  =  -K  and  the  p-line,  p.  =  p.   These  bounds  are 
3      3  3 

indicated  by  the  dashed  vertical  lines  on  Figure  3.1. 


The  range  for  the  partial  product,  P.,  in  multiplication  (-K  -  p  _<  P.  K  +  p) 
is  approximately  comparable  in  magnitude  to  the  range  of  the  shifted  partial 
remainder,  rP.,  in  division  ((-K  -  p)D  +  (1  +  K)r~6+1  <_  rP .  <_   (K  +  p)D  - 
fl  +  K)r  ft+L). 
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<o. 


Figure   3.1     P-p  Plot 
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By  investigating  the  basic  recursion,  equation  (3.4),  of  algorithm 
MULT  it  becomes  obvious  that  certain  bounds  must  also  be  placed  upon  X  and  Y 
in  order  for  the  new  partial  product,  P.+1>  to  be  within  the  required  limits 
of  +  (K  +  p).   These  operand  range  restrictions  and  their  effects  on  the 
selection  function  are  covered  in  Section  3.4. 

As  in  division,  redundancy  in  the  representation  of  the  product 
permits  the  selection  of  product  digits  to  be  based  upon  a  limited  number 
of  leading  digits  of  P..   This  is  manifested  in  Figure  3.1  by  the  overlap 
region,  A,  for  which  either  of  two  p-lines  may  be  legitimately  selected. 
For  example,  at  point  A  one  may  move  vertically  upward  to  the  p-line, 
p.  =  0,  or  downward  to  the  p-line,  p.  =  1.   In  either  case  the  product 
digit  is  correct. 

The  defined  selection  function  (3.7)  implies  that  the  comparison 
constants  for  multiplication  are  simple,  low  precision  numbers  (i.e.,  +  -r-, 
+  1  t,  +  2  t,  ...  +  (p  -  -r-)).   Then  the  product  digit,  p.,  can  be  defined 
according  to 

f  i  -  1    (i-1)  -  \  <   P.  <  (i-1)  +  \ 


'r 


l   *      l 
i      i  -  f  <  p,  <i+T. 

V.  2  —  j       2 

Denote  the  lower  and  upper  endpoints  of  the  i   p-line  as  %.   and  y 
respectively,  as  shown  in  Figure  3.2.   The  overlap,  A,  becomes 
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Pi-Pi 


+  K-- 


-K-- 


(L-l  +  K) 


■OH 


-c* 


■ol- 


1-2 


i-1 


L  +  l 


A  =  i-l+K-  (i-K) 
A  =  2K -1 


Figure  3.2   The  Overlap  Region  of  the  P-p  Plot 
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for  p  _<  i  _<  p.   From  this  plot  the  relationship  between  the  overlap  and  the 
redundancy  coefficient  is  seen  to  be 

A  =  2K  -  1  . 

As  in  the  case  of  division,  the  width  of  the  overlap  region  is 
proportional  to  the  amount  of  redundancy  in  the  representation  of  the  result. 
As  the  redundancy  is  increased,  the  width  of  the  overlap,  A,  is  increased 
and  the  required  precision  of  inspection  of  the  partial  product,  P.,  for 
selection  is  decreased.   To  fully  define  the  algorithm  MULT,  sufficient 
precision  for  selection  of  p.  must  now  be  determined. 

3.3.2   Sufficient  Precision 

Assume  that  sufficient  precision  for  the  selection  of  product  digits 
corresponds  to  the  use  of  y  most  significant  digits  of  the  partial  product 
P..   Denote  this  truncated  estimate  to  the  partial  product  as  P..   Unlike 
in  division,  multiplication  can  proceed  after  only  one  digit  of  each  operand 
has  been  input,  regardless  of  the  value  of  y.   Recall  that  selection  of  the 
first  product  digit,  p1 ,  is  based  upon 

*i  *  xi  yi 

which  must  be  guaranteed  to  be  less  than  or  equal  to  (K  +  p)  for  the  selection 
to  be  valid.   Since  the  selection  is  based  only  upon  y  most  significant 
digits  of  a  computed  value  for  P..  ,  selection  can  take  place  even  though  only 
one  digit  of  each  operand  has  been  input. 

Denote  the  maximum  error  introduced  by  this  truncation  into  the 
representation  of  P.  as  Ap.   Then  Ap  is  defined  by 

I  P.-?. I  <Ap 
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and 

-Y+l 

Ap  <  K'r  T  X  (3.8) 

where  K'  is  the  redundancy  coefficient  of  the  adder. 

The  minimum  allowable  value  for  y  can  be  determined  by  looking  at 
the  overlap  region,  A.   See  Figure  3.3.   The  precision  implied  by  y  must  be 
large  enough  to  guarantee  that  the  error  incurred  by  truncating  the  least 
significant  portion  of  P.  during  the  selection  of  p.  does  not  result  in  an 
incorrect  product  digit  selection;  that  is, 

Ap<|  (3.9) 

This  condition  is  depicted  in  Figure  3.3.   The  smallest  P.  falling  in  the 

overlap  region  and  resulting  in  a  product  digit  selection  of  p.  =  i 

-        1 
i.e.,  (P.  =  i  -  — )  must  conform  to  the  bound 
J       *• 

|P      ~   Pi    <   "£■ 

1  j    Pl  -  2   * 

The  minimum  value  of  y   can  now  be  determined.   From  equation  (3.8) 
and  (3.9),  it  is  seen  that 

,  -Y+l  ^  A  _  ZKr± 
K  r     £  2  "   2 

The  lower  bound  on  y  becomes 

Y  >  1  +  r%r  J^l  (3.10) 

In  Table  3.1  the  minimum  values  of  y  (the  number  of  redundant  digits  of  P.) 
and  y'  (the  number  of  nonredundant  bits  derived  from  the  y  most  significant 
digits  of  P.  on  which  a  carry  propagation  has  been  performed)  are  shown  as 
a  function  of  the  base  (r) ,  product  redundancy  coefficient  (K) ,  and  adder 
redundancy  coefficient  (K'). 
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Figure  3.3  P-p  Plot  Showing  the 
Worst  Case  Error 


Table  3.1  Minimum  y   Values 
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Y  =  1  +  \b 


2K' 
r  2K-1 


BASE 

K 

K' 

Y  DIGITS 

Y1  BITS 

2 

1 

1 

2 

2 

2/3 

2/3 

2 

4 

4 

1 

1 

2 

3 

4/7 

4/7 

2 

6 

8 

1 

1 

2 

4 

8/15 

8/15 

2 

8 

16 

1 

1 

2 

5 

128/255 

128/255 

2 

16 

256 

1 

1 

2 

9 
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3.4  Valid  Operand  Ranges 

As  stated  in  Section  3.3.1,  the  range  of  the  initial  operands  must 
be  restricted  to  insure  the  convergence  of  the  algorithm  MULT.   The  operands 
may  have  to  be  preshifted  initially  to  conform  to  the  specified  convergence 
bounds.   Recall  that  both  operands  are  assumed  to  be  in  normalized  form 
upon  input  to  the  arithmetic  unit;  i.e., 

j  ±   X,   Y  <  1  . 

To  be  able  to  select  the  first  product  digit,  p  ,  the  first  partial 
product,  P.,  must  conform  to  the  bound 

P_  <  K  +  o 


From  the  basic  recursion,  (3.4),  the  bound  on  X1  is  seen  to  be 

X  y  <  K  +  p  =  rK  . 


Assuming  worst  case  values  for  y  and  K,  this  implies  that 

x<4  r 


2  r-1 
and,  since 


,.   1  r 
lim  — 


2  r-1   2  * 

r-x» 

shifting  X  one  bit  to  the  right  will  guarantee  that  V     will  be  within  the 
allowed  range  limits  for  the  selection  of  p  . 

To  guarantee  convergence  of  the  algorithm  MULT  over  all  j,  it  is 
obvious  that  certain  bounds  must  also  be  placed  upon  X  and  Y  in  order  for 
each  new  partial  product  to  be  within  the  required  limits  of  +(K+p).   To 
insure  convergence  of  the  algorithm,  given  that 
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iVi-pj-i'  -K  ' 

the  upper  bound  on  the  values  of  the  operands  must  be  small  enough  to 
guarantee  that 

|P.  I  <  rK  . 

J  ~ 

to  insure  valid  selection  of  p..   From  the  basic  recursion  of  the  algorithm 
MULT,  equation  (3.4), 

P.  <■   r(P.   -p.  .)  +  X.y.  +  Y.  .x.  , 
j      j-1  *j-l     i  1    J-l  J 


worst  case  values  imply  that 


P.  <  r(P.  .-p.  ,)  +  X(r-l)  +  Y(r-l)  , 
J  _    J-l  J-l 


or  just 


P.  1  r(p-_1-P-_1)  +  (X+Y)(r-1)  . 

Let  the  upper  bound  on  both  operands  be  represented  by  M.   The 
largest  possible  value  for  M,  resulting  in  the  fewest  possible  preshifting 
steps  (scaling)  on  the  initial  operands,  should  be  found.   By  using  the 
upper  bound  M,  the  above  equation  becomes 

P.  <  r(P.   -p.  .)  +  2M(r-l)  . 
J  ~    J-l  J-l 

Replacing  P.  by  its  upper  bound  for  selection,  this  becomes 

r(P   -p   )  +  2M(r-l)  <  rK 


or,  rearranging 


r(K-(P   -p   )) 

m  <  dZ± J__i —  C3  IT) 

M-      2(r-l)       *  U*   ; 
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If  the  smallest  possible  number  of  leading  digits  of  P.   was  used  for  the 


selection  of  p._-i>  then  the  upper  bound 


<Vr»>i>  <- K 

must  be  used  in  equation  (3.11)  and  the  bound  on  the  operands  goes  to  zero! 
But,  by  inspecting  more  digits  than  the  absolute  minimum  for  selection,  the 
bound  on  (P  _-p  -)  can  be  tightened  considerably.   A  full  precision 
inspection  results  in  the  tightest  possible  upper  bound  of  — .   Thus,  for 
full  precision  inspection  the  bound,  M,  is 

M  K   r(2K-l) 
M±  4(r-l)   ' 

Assuming  maximum  redundancy  (K=l) , 

t     r      1 

lim 


4(r-l)   4   ' 

so  that  shifting  both  X  and  Y  two  bits  to  the  right  will  guarantee  convergence 
for  the  case  of  full  precision  inspection. 

A  compromise  between  the  minimum  number  of  digits  inspected  for 
selection  and  the  smallest  amount  of  necessary  initial  scaling  must  be  made. 
As  will  be  discussed  in  Chapter  5,  more  digits  of  the  partial  remainder  are 
always  required  for  selection  during  division.   Since  these  lines  are 
available,  using  them  during  multiplication  does  not  increase  the  complexity 
of  the  arithmetic  unit,  while  it  does  relax  the  bound,  M,  on  the  operands. 

3.5  Hardware  Block  Diagram 

In  this  chapter  the  algorithm  MULT  has  been  completely  defined. 
Specifically,  the  product  digit  selection  function,  SELECT,  and  initial  range 
restrictions  have  been  given.   In  this  section  a  variable  block  diagram 
imp]  (-meriting  the  MULT  algorithm  will  be  examined.   Here,  again,  only  the 
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Figure  3.4   Block  Diagram  for  Multiplication 
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major  components  of  the  arithmetic  unit  (AU)  for  MULT  will  be  discussed. 
Lower  level  details  for  actual  implementation  will  be  developed  in  Chapter  5. 

Figure  3.4  is  a  block  diagram  of  the  AU  for  performing  multiplica- 
tion.  It  is  so  structured  as  to  be  compatible  with  division,  addition,  and 
subtraction.   As  with  division  the  major  component  of  the  AU  is  a  full 
width  multi- input  limited  carry-borrow  propagation  adder. 

The  product  digit  selector  is  a  table  look  up  device  which  implements 
the  SELECT  function  for  multiplication.   It  examines  the  y  most  significant 
digits  of  P.,  P.,  and  does  essentially  a  rounding  on  it  to  select  the 
proper  product  digit,  p..   As  with  division,  the  rest  of  the  unit  consists 
of  register  and  selectors. 

3.6  Some  Numerical  Examples 

Table  3.2  through  Table  3. A  are  examples  of  the  algorithm  MULT  for 
several  difference  radices  and  product  redundancy  coefficients  (K) . 


Table  3.2   Example  of  the  Algorithm  MULT 

(r=2,K=l) 
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r  =  2,  m  =  8,  0  =  {1,0,1),  K 

X  =  0.01101001 
Y  -  0.01110011 

(R  =  0.0010111100101011) 


=  1 


X.y.+Y.  -x. 

P. 

3 

pj 

2(P.-p.) 

0.0 

0.0 

0 

0.0 

0.01 

0.01 

0 

2(0.01-0)  -  0.1 

0.101 

1.001 

1 

2(1.001-1)  =  0.01 

0.0110 

0.1010 

1 

2(0.1010-1)  =  -0.11 

0.0111 

-0.0101 

0 

2(-0. 0101-0)  =  -0.101 

0.0 

-0.101 

T 

2(-0101+l)  =  0.11 

0.0110100 

1.0010100 

1 

2(1.0010100-1)  =  0.010100 

0.11011011 

1.00101011 

1 

2(1.00101011-1)  =  0.0101011 

8 


.-8 


,-9 


R  =   Z   p. 2  J  +  2  U(P0-P0)  =  0.00110111  +  2  "(0.0101011) 
j=l   J  8   8 

=  0.0010111100101011 
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Table  3.3  Example  of  the  Algorithm  MULT 

(r=4,K=l) 


r  =  4,  m  =  4,  Q  =  {3,2,1,0,1,2,3},  K  =  1 
P 


X  =  0.01101001 
Y  =  0.01110011 

(R  =  0.0010111100101011) 


j 

X.y.  +  Y.  nx. 

P. 
j 

pj 

4(P.-P.) 

1 

0.01 

0.01 

0 

4(0.01-0)  =  1.0 

2 

3(0.0110)+2(0.01) 

=  1.1010 

10.1010 

3 

4(10.1010-3)  =  -1.1 

3 

2(0.0111) 

=  0.1110 

-0.101 

T 

4 (-0.101+1)  =1.1 

4 

3(0.01101001)  + 
0.011100 

=  1.10101011 

11.00101011 

3 

4(11.00101011-3)  = 
0.101011 

R  =     E      p.4"j   +  4   A(P4~P4)    =   0.03134   +  4   5(0.2234) 
j-l     J 


R  =   0.02330223.    =  0.0010111100101011 

4 


Table  3.4  Example  of  the  Algorithm  MULT 

(r=4,K=2/3) 


r  =  4,  m  =  4,  0   =  {2,1,0,1,2},  K  =  2/3 


X  =  0.01101001 
Y  =  0.01110011 


TO  INSURE  CONVERGENCE  — 
SHIFT  RIGHT  ONE  DIGIT  EACH 


(R  =  0.0010111100101011) 
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j 

X.y.+Y.  .x, 
J  J   3-1  J 

p. 
J 

pj 

4(VV 

1 

0.0 

0.0 

0 

0.0 

2 

0.0001 

0.0001 

0 

4(0.0001-0)  -  0.01 

3 

3(0.000110)  + 

2(0.0001) 
=  0.011010 

0.101010 

1 

4(0.101010-1)  =  -1.011 

4 

2(0.000111) 

=  0.001110 

-1.00101 

I 

4(-l. 00101+1)  =  -0.101 

5 

3(0.0001101001)  + 
0.00011100 

=  0.0110101011 

-0.0011010101 

0 

4(-0. 0011010101)  = 
-0.11010101 

R  = 


42(  I     P.4"j  +  4  5(P5-p5)) 
i=l  J 


4  (0.00110  -0.0000003111  )  = 
4  4 


0.0101000011010101  =  0.0010111100101011 
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4.   ADDITION  AND  SUBTRACTION 

Once  compatible  on-line  algorithms  for  division  and  multiplication 
have  been  specified,  it  remains  to  define  compatible  on-line  addition 
and  subtraction  algorithms.   It  would  be  possible,  since  a  limited  carry- 
borrow  propagation  adder  is  used  for  the  basic  recursion,  to  simply  feed 
the  output  of  the  adder  directly  to  the  recoding  logic.   Instead,  an 
algorithm  which  is  quite  similar  to  the  multiplication  algorithm  is 
presented.   It  will  soon  become  apparent  why  this  method  is  preferred. 

4.1  Background 

Consider  the  situation  at  the  i   digit  position  of  a  conventional 
adder .   The  addend  A  and  the  augend  B  provide  the  inputs  a .  and  b . ,  each 
of  weight  r   ,  to  the  i   stage.   A  third  input  c,  also  of  weight  r   , 
is  the  carry  output  from  digit  position  i+1.   The  outputs  of  the  i 
position  are  the  sum  digit  s.  of  weight  r   and  the  carry  out  digit  c-_i> 
of  weight  r      .   The  relationship  between  outputs  and  inputs  is,  then 

re.  ,  +  s.  =  x.  +  y.  +  c.  . 
i-1    l    l   Jx  i 

Because  the  carry  out  c._1  is  propagated  from  the  least  significant  to  the 
most  significant  end  of  the  adder,  the  speed  of  conventional  addition  is 
less  than  desirable.   It  has  long  been  recognized  that  carries  do  not 
need  to  be  propagated  during  each  addition  of  a  long  sequence  of  additions, 
provided  that  the  carries  are  explicitly  stored.   This  technique  was  first 
used  to  speed  up  multiplication,  with  a  single  carry  propagation  at  the 
]  of  the  operation. 
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Storing  the  carry,  c._, ,  is  a  method  of  introducing  redundancy 
into  the  adder.   Much  work  has  been  done  on  the  development  of  redundant 
adders  [ROH67,  BOR68,  MEL72,  GOY76].   In  particular,  Rohatsch's  work 
proves  that  the  utilization  of  any  redundant  and  contiguous  sum-difference 
digit  set  makes  possible  the  implementation  of  limited  carry-borrow 
addition  and  subtraction;  that  is,  addition  and  subtraction  for  which 
carries-borrows  propagate  no  further  than  a  fixed  number  (one  or 
two  in  practice)  of  digital  positions  is  possible.   In  Goyal's  work  an 
exhaustive  study  is  done  of  various  design  techniques  and  the  reader  is 
referred  to  this  work  for  the  gory  details. 

Since  the  carry-borrows  can  be  limited  to  one  or  two  digital 
positions,  a  redundant  adder  could  be  used  directly  for  on-line  addition 
with  an  on-line  delay  of  up  to  two  digits.   There  is,  however,  an 
algorithm  which  involves  no  delay  for  producing  on-line  sums  and  differ- 
ences.  This  algorithm  will  now  be  presented. 

4.2  The  On-line  Algorithm 

Let  the  radix  r  representation  of  the  fractional  part  of  the 
positive  addend  (minuend),  augend  (subtrahend),  and  sum  (difference)  be 
denoted  by  A,  B,  and  R  respectively,  such  that 

m 
A  =  E  a.r-1   , 
i-1   X 


m 
B  =   E  b.r  1 
i=l  X 


m 
R  =   E   s.r-1+1 
i-1  X 


68 

and  R  =  A+B  (R=A-B)  to  m  digit  precision. 

In  an  on-line  environment  the  digits  of  the  addend  (minuend)  and 
augend  (subtrahend)  are  not  known  in  advance,  but  are  available  on-line, 
digit-by-digit,  most  significant  digit  first.   The  operand  digits  are 
typically  members  of  a  conventional  nonredundant  digit  set,  £>,  such  that 

a.,b.  e  {0,1,2,.. .(r-1)}   . 
11 

It  is  assumed  that  the  operands  are  in  normalized  form;  i.e., 

|-  <  A,B  <  1   . 

Assume  that  the  sum  (difference)  digits  can  be  selected  on-line. 
One  new  digit  of  the  sum  (difference)  would  then  be  determined  upon  the 
receipt  of  one  new  digit  each  of  the  addend  (minuend)  and  augend  (subtra- 
hend) .   In  addition  (subtraction)  the  index  difference,  6,  is  identically 
1.   That  is,  only  one  digit  of  each  of  the  operands  is  needed  initially 
to  select  the  first  result  digit.   Let  the  sum  (difference)  digits,  s.'s, 
be  members  of  the  same  symmetric  redundant  digit  set,  ©  ,  as  defined  for 
the  quotient  and  product  digits. 

The  partial  sum  (difference)  is  computed  via  the  same  limited 
carry-borrow  propagation  adder  used  to  generate  the  partial  remainder  during 
division  and  the  partial  product  during  multiplication.   Thus,  the  digits 
of  the  partial  sum  (difference)  are  members  of  the  redundant  digit  set,  0  ,. 

Given  these  definitions,  the  algorithm  ADD  and  the  algorithm  SUBT 
as  shown  on  the  succeeding  pages,  can  be  defined  in  the  same  manner  as  the 
algorithm  MULT  was  defined  for  multiplication.   The  selection  procedure 
for  result  digits  corresponds  to  a  simple  rounding  on  P,  exactly  as  done 
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in  multiplication.   Thus,  again,  the  basic  recursion  includes  a  "correction" 

on  P .  ..  to  take  into  account  the  previous  selection  of  s .  ,. 
3-1  3-1 

As  in  multiplication  a  recursion  relationship  for  addition  which 
takes  into  account  the  on-line  nature  of  the  operands  only  can  be  expressed 


as 


P.  ■*■   rP._  +  (a.+bjr"1  (4.3) 


Algorithm  ADD: 

Step  1 

[Initialization]:    P_  ■«■  0; 

sQ.0; 

j  «-  i; 
Ro  *  0j 

Step  2 

[Input  Digit]:    a.  AND  b.; 

Step  3 

[Basic  Recursion]: 

» 

P.  «-  r(P.   -s.  ,)  +  (a  +b.)r 
3      3-1  3-1      3   3 

(4.1) 

Step  4 

[Selection]:    s.  «-  SELECT(P.); 

R.  ^  R.  •  +  s.r~J+1   ; 
3     3-1    3 

Step  5 

[Test]:     IF  (j  <  m) 

THEN  j  *■   j  +  1; 

GO  TO  Step  2; 
ELSE  END  ADD; 
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Algorithm  SUBT: 


Step  1   [Initialization]:    P  «-  0; 


s0-0; 


R0-0; 


j  *  i; 


Step  2   [Input  Digits]:    a.  AND  b.; 


Step  3   [Basic  Recursion]: 


R.  «■  R.  .  +  s.r  ^+1 
J    J-l    J 


Step  5   [Test]:    IF  (j  <  m) 

THEN  j  ■*-  j  +  1; 

GO  TO  Step  2; 
ELSE  END  SUBT; 


P.  «-  r(P.  ,-s.  .)  +  (a.-b.)r  1  (4.2) 

3      J-l  J-l      J   J 


Step  4   [Selection]:    s.  «-  SELECT(P.); 


Using  this  recursion,  the  sum  digits  would  become  available  from  the 

least-to-most  significant  end  of  P  ,  as  determined  by  the  traditional 

m 

carry  propagation  requirements.   From  equation  (4.3)  and  the  basic  recursion 
equation  in  algorithm  ADD,  (4.1),  the  following  relation  for  addition  can 
be  obtained  by  induction. 


J-l     •  •  ,-, 
P.  =  P.  -   £   s.r 


1=1 
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This  implies  that,  as  j  •*■  m, 

m-1 


m-i+1 
P  =  P  -  Z   sj 
m    m    .  ,   i 
1=1 


or  rearranging 

P  =  P  +  rm  Z  s,r"i+1   .  (4.4) 

m    m      i=1  i 

Since  P  =  (A+B)r  for  addition  and  P  =  (A-B)r  for  subtraction,  equation 
m  m 


(4.4)  can  be  rewritten  as 


a  m— 1  .  . , 

(A+B)r     =  P     +  r       Z     s.r 
—  m  .    .      i 

i=l 


which  is  just 


m  -'-4-1 

A  +  B  =     Z     s.r        ■"■  +   (P  -s   )r"m     , 
—  .    ,      l  mm 

i=l 

so  that 

m  .  .    _ 

R  =     Z      s.r   1"1"1  =   (A+B)    -    (P  -s   )r"m      . 
.    .      l  —  mm 

i=l 

By  using  the  product  selection  procedure  SELECT  defined  for  multi- 
plication in  Chapter  3,  Section  3.3,  then 

|P  -s  |  <  K  (4.5) 

1  m  m'  — 

and  R  =  A+B(R=A-B)  can  be  computed  to  m  digit  precision. 
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4.3  Sum  (Difference)  Digit  Selection 

Using  the  same  selection  function  as  defined  for  multiplication  of 

r   .   :  ,.i.  ,.i 


s.  «-  SELECT  (P 


sign  P  *l|P  |  +  -J 


j)  = 


for  |P.|  <  p,  (4.6) 


^sign  PJ*L|Pj|  J 

otherwise, 

will  insure  the  convergence  of  the  algorithms  ADD  and  SUBT.   This  represents 
a  simple  rounding  procedure  on  P . .   Recall,  however,  that  the  partial  sum, 
P.,  is  in  redundant  form.   Thus,  a  "simple"  rounding  would  require  a  full 
precision  carry  propagation  of  P.  in  order  to  determine  its  magnitude. 
This  must  be  avoided  if  at  all  possible. 

Sufficient  precision  for  the  proper  selection  of  sum  digits  can  be 
determined  by  looking  at  the  basic  recursion  of  the  algorithm  ADD,  equation  4.1, 

A.  A  1 

P.  «-  r(P.   -s.  n)  +  (a.+  b.)r    . 
3      3-1  3-1      3   3 

Using  the  same  upper  bound  on  P.  for  proper  selection  as  defined  for  multipli- 
cation, 

|Pjl  <  rK  » 
and  assuming  worst  case  values  for  a  and  b.  of  (r-1) ,  the  basic  recursion  can 
be  restated  as 

r(P   -s   )  +  2(r-l)r~1  <_  rK  . 

The  bound  on  the  'residual',  P.    -  s._  ,  is  then 

(P.   -s.  .)  <  K  -  2(rI1)   .  (4.7) 

3-1   3-1  ~       r2 
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For  base   2(K=1),    this   is  just 

<Vi-Vi>  <  .1  - !  - 1  • 

So,  for  base  2  a  full  precision  inspection  of  P.  to  select  s.  must  be 
performed.   For  base  4(K=1),  equation  (4.7)  becomes 

(p   _s   \  "<  1  „  A.  m  I 
*  3-1  Sj-1;  -  X   16   8 

which  implies  a  two  digit  inspection  of  P.  to  select  s..   Selection  require- 
ments for  other  cases  can  be  determined  in  a  similar  manner. 

4.4  Hardware  Block  Diagram 

In  this  chapter  the  algorithms  ADD  and  SUBT  have  been  completely 
defined.   In  this  section  a  variable  block  diagram  implementing  the  ADD  and 
SUBT  algorithms  will  be  examined.   Here,  again,  only  the  major  components  of 
the  arithmetic  unit  (AU)  for  ADD  and  SUBT  will  be  discussed.   Lower  level 
details  for  actual  implementation  will  be  developed  in  Chapter  5. 

Figure  4.1  is  a  block  diagram  of  the  AU  for  performing  addition  and 
subtraction.   It  is  so  structured  as  to  be  compatible  with  division  and 
multiplication.   As  with  the  other  algorithms,  the  major  component  of  the  AU 
is  a  full  width  multi- input  limited  carry-borrow  propagation  adder. 

The  sum  (difference)  digit  selector  is  a  table  look  up  device  which 
implements  the  SELECT  function  described  for  addition  and  subtraction  (i.e., 
the  same  function  as  in  multiplication) .   It  examines  the  y  most  significant 
digits  of  P.,  P.,  and  does  essentially  a  rounding  on  it  to  select  the  proper 
sum  (difference)  digit,  s..   As  with  the  other  algorithms,  the  rest  of  the 
unit  consists  of  registers  and  selectors. 
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4.5   Some  Numerical  Examples 

Table  4.1  through  Table  4.3  are  examples  of  the  algorithms  ADD  and 
SUBT  for  several  different  radices  and  sum  (difference)  redundancy 
coefficients  (K) . 
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Table  4.1  Example  of  Algorithm  ADD 
(r=2,K=l) 


r  =  2,  m  =  8,  D  -  (1,0,1),  K  =  1 

A  -  0.11011001 
B  =  0.10100101 


(R  =  1.01111110) 


j 

(a.+  b.)  2"1 

p. 
J 

s . 

J 

2(Vs.) 

1 

1.0 

1.0 

1 

2(1.0-1)   =  0.0 

2 

0.1 

0.1 

1 

2(0.1-1)    =  -1.0 

3 

0.1 

-0.1 

I 

2(-0.1+l)   =  1.0 

4 

0.1 

1.1 

1 

2(1.1-1)   =  1.0 

5 

0.1 

1.1 

1 

2(1.1-1)   =  1.0 

6 

0.1 

1.1 

1 

2(1.1-1)    =  1.0 

7 

0.0 

1.0 

1 

2(1.0-1)   -  0.0 

8 

1.0 

1.0 

1 

2(1.0-1)   =  0.0 

9 

0.0 

0.0 

0 

2(0.0-0)   =  0.0 

m+1    i+1 

R  =   Z  s  r  1   =  1.11111110 
1=1 

=  1.01111110 


77 


Table  4.2  Example  of  Algorithm  ADD 

(r=4,K=l) 

r  =  4,  m  =  4,  ©  =  {3,2,1,0,1,2,3},  K 

A  =  0.11011001 
B  =  0.10100101 

(R  =  1.01111110) 


1 
2 
3 
4 

5 


(a.+  b.)  4  X 

P. 
3 

s . 
3 

4(P.-s.) 

1.01 

1.01 

1 

4(1.01-1)   =  1.0 

.11 

1.11 

2 

4(1.11-2)   =  -1.0 

.11 

-0.01 

0 

4(-0.01-0)   =  -1.0 

.10 

-0.10 

I 

4 (-0.10+1)   =10.0 

0.0 

10.0 

2 

4(10.0-2)   -  0.0 

m¥1      _-+1 

R  =  E  s.r  1  L   =  1.2012, 

•  i   1  4  . 

x=l 

=  1.01111110 
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Table  4.3  Example  of  Algorithm  SUBT 

(r=4,K=l) 

r  =  4,  m  =  4,  ©p  =  {3,2,1,0,1,2,3},  k  =  1 

A  =  0.11011001 
B  =  0.10100101 

(R  =  0.00110100) 


j 

(a.-b.)4"1 
J      J 

P. 
J 

s . 
3 

4(P.-S.) 

1 

0.01 

0.01 

0 

4(0.01-0)   =  1.0 

2 

-0.01 

0.11 

1 

4(0.11-1)   =  -1.0 

3 

0.01 

-0.11 

I 

4(-0.11+l)   =  1.0 

4 

0.0 

1.0 

1 

4(1.0-1)   =  0.0 

5 

0.0 

0.0 

0 

4(0.0-0)   =  0.0 

m+1      --+1 

R  =  E  s.r  X   =  0.1110.  =  0.01010100 
1=1  X  4, 


=  0.00110100 
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5.   IMPLEMENTATION 

The  previous  three  chapters  have  dealt  with  the  algorithmic  design 
for  on-line  division,  multiplication,  and  addition/subtraction.   It  is  now 
appropriate  to  consider  the  problems  of  actual  implementation  of  the  on-line 
arithmetic  unit  while  keeping  in  mind  OBJECTIVES  4,  5,  6,  and  7  as  outlined 

in  Chapter  1.   With  the  typical  computer  scientist's  fondness  for  acronyms, 

t 
this  unit  has  been  dubbed  AURORA   (Arithmetic  Unit  Realizing  On-line 

Redundant  Algorithms).   This  chapter  attacks  the  problems  of  implementation: 
the  design  constraints  of  LSI,  floating  point  considerations,  design  modifi- 
cations required  to  pipeline  AURORA,  minimal  hardware  requirements  with  the 
corresponding  speed  tradeoffs,  and  a  system's  level  overview. 

5.1  Design  Constraints  of  LSI 

With  the  advent  of  large  scale  integration  (LSI)  technology  has  come 
a  challenge  to  computer  designers  to  find  circuits  which  make  efficient  use 
of  its  full  potential.   The  ever  improving  reliability,  cost,  and  size  of 
integrated  circuits  makes  it  more  and  more  reasonable  to  now  implement  in 
hardware  various  functions  which  previously  belonged  in  the  software  domain. 
It  is  this  author's  opinion  that  AURORA,  in  keeping  with  OBJECTIVE  4,  makes 
an  excellent  candidate  for  design  as  a  single  chip,  LSI  module.   When 


t 

An  aurora  is  appropriately  defined  as  "the  early  period  of  anything."   In 

Roman  mythology  Aurora  was  the  goddess  of  the  dawn.   Thus,  the  name  seems 

suitable  in  all  respects:  connotation,  gender,  and  acronymability . 
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designing  the  module,  several  properties  innate  to  LSI  technology  [HOD76, 
GOY76,  LEW74]  must  be  adhered  to.   These  properties  include: 

•  high  circuit  density, 

•  regularity  of  structure, 

•  low  pin  count,  and 

•  a  large  domain  of  applications. 

With  LSI,  thousands  of  individual  gates  can  be  fabricated  on  an 
extremely  small  chip  of  silicon.   There  has  been  a  problem  finding  suitable 
functions  which  require  a  large  number  of  gates  while  still  meeting  the 
other  design  constraints  of  LSI.   Semiconductor  memories  (RAMs,  ROMs,  and 
PLAs)  are  obviously  the  most  suitable  candidates  for  LSI  implementation. 
They  more  than  satisfy  the  previously  defined  properties.   Calculator 
chips  are  also  suitable  candidates.   The  microprocessor,  which  is  becoming 
more  and  more  widely  used,  can  be  made  suitable  for  implementation  in  LSI. 
Another  area  in  which  LSI  technology  is  apparent  is  in  digital  watch  chips. 
The  capability  provided  by  LSI,  though,  is  considerably  more  than  that 
required  for  the  typical  watch  function — telling  the  time  and  date. 
Consequently,  watches  which  are  also  timers,  counters,  alarms,  and  even 
calculators  are  appearing  on  the  market  in  order  to  make  full  use  of  the 
inherent  LSI  capabilities.   The  search  goes  on  for  other  candidates  which 
will  make  full  use  of  the  potential  of  LSI  while  meeting  the  predefined 
constraints.   AURORA,  similar  in  function  to  a  microprocessor,  will  have  a 
high  gate  count  (high  circuit  density).   Thus,  an  effort  must  be  made  to 
make  AURORA  conform,  as  the  microprocessor  has  been  made  to  conform,  to 
the  other  constraints  of  LSI. 
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One  of  the  most  severe  constraints  in  the  design  of  an  LSI  device 
is  the  restriction  on  interconnections  within  the  chip  itself.   This  is  due 
to  two  factors:   1)  the  comparatively  large  chip  area  required  for  inter- 
connections reducing  the  "useable"  chip  area,  and  2)  the  immense  problem 
of  routing  a  large  number  of  signals  while  maintaining  a  minimum  number  of 
crossovers  and  high  gate  density.   Interconnections  can  be  simplified  by 
forcing  the  logic  design  of  the  chip  into  a  cellular  or  regular  structure. 
A  regular  structure  has  other  important  implications. 

1)  It  simplifies  the  manufacturing  process  by  making  tooling 
and  mask  production  easier  and  by  helping  to  regulate  the 
processing  steps. 

2)  It  makes  it  possible  to  optimize  each  cell  and,  thus,  the 
overall  chip  to  achieve  the  most  function  per  dollar.   A 
large  collection  of  random  gates  is  virtually  impossible 
to  optimize  because  of  the  large  number  of  variables 
involved. 

3)  The  generation  of  testing  algorithms  for  cellular  and 
regular,  repetitive  structures  is  easier  than  for  random 
logic. 

A)  In  addition,  as  technological  improvements  pave  the  way 
for  larger  devices,  cellular  structures  are  more  easily 
expanded  to  make  use  of  this  bonus  real  estate. 


t 


Unavoidable  crossovers  call  for  more  processing  steps  and,  thus,  lead  to 
higher  tooling  costs. 
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In  that  microprocessors  have  been  made  to  conform  to  the  constraint 
of  regularity  of  structure,  so  can  AURORA.   Typically,  the  most  random 
part  of  any  system  is  its  control  logic.   It  is  envisioned  that  AURORA 
would  be  microprogrammed,  possibly  via  a  PLA  (Programmable  Logic  Array) 
[RHY74,  WES 75]  to  avoid  randomness  in  the  control  logic. 

The  PLA  would  generate  all  necessary  control  signals  for  processing 
based  upon  the  opcode  input  to  the  unit  (+,  -,  *,  /),  the  present  state  of 
the  process,  internal  and  external  status  signals,  and  a  clock  pulse.   See 
Figure  5.1.   A  PLA,  which  can  be  visualized  as  a  logical  AND  network  feeding 
a  logical  OR  network,  works  like  an  ROM  with  flexible  addressing.   Though 
suitable  for  LSI,  using  an  ROM  as  a  logic  element  has  a  major  drawback  in 
that  a  word  must  be  stored  for  every  possible  combination  of  system  inputs, 
many  of  which  may  be  don't  cares  and  are  therefore  never  used.   The  PLA's 
address  translation  capability  allows  more  than  one  of  these  possible  input 
combinations  to  share  a  word  of  storage,  making  the  PLA  practical  in  many 
applications  too  large  for  the  typical  ROM.   The  PLA  also  has  more  input 
(address)  lines  than  a  comparable  ROM.   Using  a  PLA  instead  of  the  standard 
ROM  for  the  microcontrol  store  provides  for  self-addressing  of  the  next 
control  state  as  well  as  the  generation  of  the  required  control  signals 
within  a  single  array. 

The  processing  logic  of  AURORA  will  require  the  use  of  proper  logic 
partitioning  to  make  it  suitable  to  LSI.   Logic  partitioning  involves  the 
organization  of  the  internal  logic  structure  so  that  large  functional  areas 
(or  arrays)  on  the  chip  can  be  grouped  together  and  used  repetitively. 
External  to  the  chip,  functional  partitioning  of  the  overall  system  requires 
a  framework  consisting  of  modules  which  are  completely  self-contained 
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processors,  each  having  its  own  local  store,  processing  logic,  and  the 
control  necessary  for  the  module  to  execute  its  function.   Thus,  each  module 
acts  as  a  small  insular  unit  of  logic.   Since  each  module's  control  sees 
only  its  own  state,  the  internal  and  external  communication  requirements  are 
correspondingly  reduced.   The  AURORA  as  a  module  of  a  larger  system  meets 
this  criterion.   Internal  to  the  chip,  an  effort  must  be  made  to  subdivide 
the  logical  units  of  the  processing  logic  in  order  to  achieve  logic 
partitioning  and  its  benefits. 

One  of  the  most  serious  constraints  imposed  by  LSI  on  any  device  is 
that  of  a  limited  number  of  external  connections  (i.e.,  a  low  pin  count). 
A  realistic  maximum  of  64  external  connections  (pins)  is  an  unavoidable  LSI 
limitation.   AURORA  should  require  fewer  pins  than  this  upper  limit.   The 
unit  will  be  designed  so  that  it  can  be  connected  easily  to  the  bus  of  a 
typical  microprocessor.   A  bus  system  based  upon  generic  signals,  such  as 
MUMS  (Modular-Unif ied-Microprocessor-System)  developed  by  Faiman  [FAI77, 
CAT76],  could  be  used  as  a  model  for  designing  the  communication  interface 
of  AURORA.   This  would  help  to  insure  generality  of  the  interface  with  a 
system.   The  number  of  signals  on  the  MUMS  bus  sets  an  upper  limit  on  the 
pin  count  for  communication  with  the  mother  system  at  47:   a  maximum  of  32 
for  data  (16  "data"  lines  and  16  "address"  lines)  and  15  for  control  (3  power 
lines,  6  memory  control  lines,  and  6  interrupt  control  lines).   The  upper 
limit  on  the  number  of  bits  per  operand  word  input  at  the  beginning  of  each 
cycle  is,  then,  16.   Thus,  the  choice  of  radices  to  be  used  in  the  interal 
hardware  for  implementation  of  the  algorithms  can  be  realistically  set  at 
either  2,  4,  16,  or  256.   The  higher  the  radix  the  higher  the  speed  and  the 
higher  the  complexity  (gate  count).   A  rough  overall  pin  count  for  AURORA 
is  attempted  in  Section  5.4.   This  count  must  also  include  requirements  for 
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communication  with  an  exponent  unit  for  floating  point  operations  and, 
possibly,  communication  with  other  AURORA  modules. 

By  designing  AURORA  to  meet  the  specifications  of  a  typical  micro- 
processor bus  system,  the  domain  of  usefulness  increases  dramatically.  The 
desirability  of  having  a  peripheral  unit  on  a  microprocessor  bus  capable 
of  achieving  on-line  variable  precision  arithmetic  is  obvious.   This  and 
other  applications  of  AURORA  are  discussed  in  Chapter  6. 

5.2  Floating  Point  Considerations 

In  early,  fixed  point  computers,  all  numerical  data  within  the  com- 
puter was  scaled  to  lie  within  a  restricted  range;  a  frequent  choice  was  that 
of  a  fractional  representation  such  that  each  number,  X,  used  in  computation 
was  in  the  range  -1  <  X  <  1.   It  soon  became  obvious  that  in  order  to  reduce 
the  burden  of  scaling  during  preparation  or  execution  of  a  problem,  a  second 
arithmetic  unit,  for  operations  on  exponents,  should  be  introduced.   Then 
a  number  X  could  be  represented  by  a  fractional  part  f  and  an  exponent  e, 
such  that 

X  =  f  •  re 

where  e  is  a  positive  or  negative  integer  and  f  is  a  fraction  in  the  normalized 
range 

\±  |f|  «i  . 

In  order  for  AURORA  to  process  floating  point  numbers,  two  arithmetic 
units  are  required,  one  for  the  fractional  parts  and  one  for  the  exponents. 
As  far  as  the  handling  of  data  between  the  mother  system  and  these  two  pro- 
cessing units,  the  following  sequence  of  operations  would  occur:   the 
exponents,  which  are  assumed  to  be  one  memory  word  each,  would  be  sent  to 
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the  exponent  unit;  the  exponent  unit  would  then  handle  these  exponent 
operands  appropriately,  sending  the  proper  shift  signals  to  the  mantissa  unit; 
and,  while  these  exponents  were  still  being  processed,  the  fractional 
operands,  consisting  of  several  memory  words  each,  would  be  sent  in  an  on- 
line fashion  to  the  fractional  unit.   Upon  normalization  and  exponent 
adjustment  of  the  results  via  an  investigation  of  the  most  significant 
digits  of  the  mantissa  as  soon  as  they  became  available,  the  exponent  of 
the  result  could,  then,  be  returned  to  the  mother  system  followed  by  an 
on-line  return  of  the  result  words  as  they  are  generated.   A  sample  mother 
system  organization  is  given  in  Section  5.5. 

If  the  operation  in  question  was  addition  or  subtraction,  an  exchange 
of  communication  such  as  the  following  must  occur  between  the  two  units. 

1)  The  exponent  unit  would  determine  which  operand  was 
smaller. 

2)  The  mantissa  unit,  upon  signals  from  the  exponent  unit, 
would  shift  the  appropriate  operand  to  the  right  while 
the  exponent  unit  would  increase  that  exponent 
accordingly,  until  the  two  exponents  agree. 

3)  The  mantissa  unit  would  start  the  on-line  addition  or 
subtraction. 

4)  When  the  most  significant  digits  of  the  result  were 
produced  in  the  mantissa  unit,  the  result  could  be 
normalized  (shifted  right  if  necessary)  and  the  mantissa 
unit  would  signal  the  exponent  unit  to  adjust  the 
result  exponent  (i.e.,  the  exponent  of  the  larger 
operand)  appropriately. 


Multiplication  is  only  slightly  more  complicated  in  that  the  sum 
of  the  exponents  must  be  formed  in  the  exponent  unit  to  produce  the  exponent 
of  the  result.   One  step  of  normalization  in  the  form  of  one  left  shift  of  the 
result  and  a  unit  decrease  of  the  result  exponent  may  be  necessary  if  the 
fractional  product,  R,  falls  in  the  range 

<2>   <  W  <  2   ' 

Similarly,  for  division  the  difference  of  the  exponents  must  be 

formed  by  the  exponent  unit  to  produce  the  result  exponent.   The  fractional 

quotient,  R,  if  it  falls  in  the  range 

j 

-  «     ? 

l  <_   |r|  <  2 

a 

'J  I 

may  then  require  normalization  by  one  right  shift  with  a  unit  increase  in  the 
result  exponent. 

A  communications  protocol  between  the  two  units,  exponent  and  mantissa, 
is  shown  in  Figure  5.2.   Recall  that  operand  scaling  and  the  corresponding 
result  correction  to  insure  convergence  of  the  algorithms  is  internal  to  the 
mantissa  processing  unit. 

5.3  Algorithmic  Modifications  of  AURORA  for  Pipelining 

Up  to  this  juncture  AURORA  has  been  discussed  from  the  point  of  view 
of  the  serial  processing  of  multiple  memory  words  on  a  single  unit  which 
accepts  one  memory  word  per  cycle.   A  problem  arises,  then,  if  the  memory 
width  is  wider  than  the  operand  width  accepted  by  one  AURORA  during  one  cycle. 
Perhaps  this  problem  can  be  resolved  by  connecting  multiple  AURORA'S  together, 
each  accepting  one  "byte"  of  the  memory  word,  with  all  of  the  units  running 
in  parallel  to  produce  multiple  result  "bytes"  in  parallel  constituting  one 
result  word.   (Note  that  the  data  input  to  an  AURORA  has  been  redefined  from 
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"word"  to  "byte"  in  this  context.)   If  so,  would  it  then  be  possible  to 
combine  the  two  approaches  (serial  and  parallel),  producing  a  serial  system 
of,  say,  N  AURORA  units  which  run  in  parallel?   The  entire  system,  while 
cycling  serially,  would  then  produce  N  result  bytes  during  each  cycle. 

While  this  is  an  admirable  goal,  on  second  consideration  it  is 
deemed  improbable  that  such  a  system  of  simple  AURORAS  could  be  designed. 
Since  the  AURORA'S  operations  include  division  and  multiplication,  such  a 
design  specification,  allowing  an  unspecified  number  of  parallel  units 
to  be  connected  to  produce  parallel  results,  implies  parallel  division  and 
multiplication!   Such  a  conclusion  is  intuitively  impossible. 

There  are  two  more  or  less  attractive  alternative  solutions  to  this 
problem.   The  first  approach  is  based  upon  the  assumption  that  the  demand 
that  parallel  units  produce  parallel  results  is  removed.   Then  the  units, 
connected  as  shown  in  Figure  5.3,  though  capable  of  being  loaded  in 
parallel,  would  produce  result  bytes  in  a  serial  fashion  rippling  down 
from  the  most  significant  AURORA.   The  necessary  information  would  be 
pipelined  from  left  to  right  through  the  units.   This  information  to  be 
pipelined  through  AURORAS  arranged  in  this  manner  must  be  determined.   Such 
an  arrangement,  of  course,  will  increase  the  complexity  of  the  unit,  while 
the  communication  required  between  units  will  increase  the  pin  count  of 
each  unit.   But  the  speed  benefits  of  streaming  multiple  instructions 
through  the  units  as  allowed  in  pipelining  [RAM77],  would  more  than  com- 
pensate for  the  added  complexity  costs. 

Another  technique  for  accommodating  the  parallel  input  of  oversized 
operand  words  is  to  use  one  AURORA  unit  serially  as  in  Figure  5.4,  while 
providing  large  holding  shift  registers  for  the  oversized  operands.   While 
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this  approach  requires  the  purchase  of  only  one  expensive  AURORA  unit,  it 
must  be  dedicated  to  a  single  instruction  until  that  instruction  is 
completed.   Thus,  the  speed  benefits  of  pipelining  are  unavailable. 

The  necessary  information  to  be  pipelined  through  the  multiple 
AURORA  setup  will  now  be  specified.   The  basic  recursion  of  each  operation 
must  be  investigated  to  determine  this  information.   For  addition  and 
subtraction  it  is  obvious  from  the  basic  recursion,  (4.1), 

P.  «-  r(P.  ,-s.  ,)  +  a.  +  b. 

J      3-1   J-l     J    J 

that  the  information  to  be  passed  down  the  pipe  from  one  stage  to  the 
next  is  just  the  "residual" 

*<Vi-i-i> 

which  increases  in  significant  width  as  it  flows  from  one  unit  to  the  next 
down  the  pipe.   The  problem  becomes  slightly  more  complicated  when  multi- 
plication is  attempted.   The  basic  recursion,  (3.4), 

P.  -*-  r(P.  .-p,  n)  +  X.  y.  +  Y.    x. 
J      J-l  J-l     J   J    j-l  J 

t 
indicates  that  not  only  the  residual,  but  also  the  operand  digits  up  to 

and  including  x.  and  y.  must  be  passed.   It  is  only  in  this  way  that  the 

next  stage  can  then  form  X.  and  Y.  , .   The  information  to  be  passed  in 

J      J-l 

division  is  not  at  all  obvious.   The  basic  recursion,  (2.3), 

P.  «•  rP,   -  q.D.  +  n.^.r"6  -  R.  .d.-.r 
J     J-l    J  J    J+6       J-l  J+<5 


For  simplicity  during  this  discussion,  it  is  assumed  that  one  byte  of  an 
operand  or  result  word  is  equivalent  to  one  digit  of  that  operand  or 
result  word. 
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must  be  broken  down  further  into 


in  order  to  reveal  the  desired  information.   Since 

j_1    -i 
R.d.^.  -  (  I      q.r   )d.  ,.  +  q.d.^.  , 

j  J+6   i=1  i    j-hs   mj  j-ks 

equation  (5.1)  can  be  rewritten  as 

P.  «•  rP.  n  -  q.D.  .  +  n.^r"6  -  R.d.^r"6   .  (5.2) 

By  using  equation  (5.2)  as  the  basic  recursion  for  division,  the  informa- 
tion to  be  passed  down  the  pipe  is  seen  to  be  the  division  residual, 

rP.  t  -  q.D.  ..   , 
J-1    3  J-1 

the  divisor  digits  up  to  and  including  d.^.  (to  be  used  in  forming  D.  1  in 
the  next  stage),  and  all  of  the  quotient  digits  up  to  and  including  q.+1 
(to  be  used  in  forming  R.  in  the  next  stage). 

The  necessary  information  to  be  sent  down  the  pipe  over  all 
operations  is,  then,  a  residual,  both  operands,  and  the  result.  By  sending 
a  digit  of  each  operand  and  result  at  the  end  of  every  cycle,  by  the  time 
the  last  stage  of  the  pipe  is  processing  an  operation  all  of  the  necessary 
information  has  been  sent  to  it.  , 


The 


residual  can  be  passed  either  as  its  individual  elements,  P._i 


and  p._,  for  multiplication,  or  the  residual  can  be  explicitly  formed 
in  the  unit  and  passed  as  one  unique  entity.   Of  course,  this  would  require 
another  adder  (or  another  pass  through  the  main  adder)  in  each  AURORA  to 
form  the  residual.   The  addition  of  this  residual  adder  simplifies  the 
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design  of  the  main  adder.   Section  5.4.2  investigates  the  hardware  adjust- 
ments of  the  processing  logic  necessary  to  pipeline  the  AURORA. 

Some  obvious  hardware  complications  besides  the  extra  residual 
adder  in  the  processing  logic  are  1)  more  pins  for  the  communication 
requirements  between  the  pipelined  AURORAS,  2)  a  much  more  complex  micro- 
control  store,  and  3)  a  bank  of  extra  registers  to  store  the  operand  and 
result  digits  as  they  arrive. 

5.4  AURORA  Hardware  Requirements 

A  personalized  hardware  block  diagram  has  been  shown  for  each 
operation  in  its  appropriate  chapter.   It  now  remains  to  combine  the  various 
requirements  into  one  total  arithmetic  unit  which  can  be  microprogrammed 
to  provide  the  various  functions.   Low-level  (gate)  detail  of  an  actual 
implementation  will  not  be  discussed.   The  description  given  here  will  be 
only  detailed  enough  to  estimate  the  following  features:   complexity, 
speed,  and  pin  count. 

5.4.1  The  Processing  Logic 

The  block  diagram  of  the  processing  logic  of  AURORA  is  shown  in 
Figure  5.5.   For  clarity,  the  basic  recursion  formulas  are  now  restated. 

DIVISION  P.  -e  rP.  -  q.D.  +  n.^r"6  -  R.  -d.^r"6 

(R-N/D)  J     J    ,J  J     J+6       J_1  J+6 

MULTIPLICATION     P.  «-  r(P.  .-p.  . )  +  X.y .  +  Y.  -X 


(R--X-Y) 


j       j-1  M-l'    "j'j     j-1  j 


ADDITION  P.  +-   r(P.  .-s.  ,)  +  a.  +  b 


(R=A+B) 


j     '  j-1   j-1      j     j 


SUBTRACTION        P.  ^  r(P.   -s.  J+a.  -b 


-IS) 


j     i-1  j-1    j    j 
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The  processing  unit  is  made  up  of  an  adder,  a  result  digit  selector, 
registers,  and  selection  networks.   A  brief  discussion  of  the  individual 
elements  will  now  be  given. 

Two  full  width  double  bank  registers  are  required  for  the  storage 
of  the  result,  R,  and  the  partial  remainder,  P.,  because  they  are  in 
redundant  form.   Two  full  width  single  bank  registers  will  suffice  for 
the  storage  of  the  operands:   one  of  N,  X,  or  A  and  one  of  D,  Y,  or  B. 
The  selection  networks  must  be  capable  of  forming  the  required  multiples 
of  their  inputs.   In  the  case  of  radix  complement  representation,  the 
selection  network  must  also  be  able  to  form  the  complement  of  these  possible 
multiples.   The  complexity  of  the  selection  networks  increases  for  higher 
radices  and,  since  the  additional  multiples  appears  as  inputs  to  the 
adder,  the  complexity  of  the  adder  would  also  be  increased.   A  higher 
radix,  while  reducing  the  number  of  cycles  per  operand  word  thereby 
increasing  the  processing  speed,  does  increase  the  complexity  of  the 
processing  unit.   For  this  reason  and  because  of  the  complexity  of  the 
result  digit  selection,  radix  256  should  be  eliminated  from  further 
serious  consideration.   The  choice  among  the  other  possible  radices  (2,  4, 
and  16)  depends  upon  speed  requirements  and  the  amount  of  hardware  com- 
plexity allowed. 

The  result  digit  selector, is  a  table  look  up  device  (ROM)  which 
implements  the  SELECT  functions  of  the  appropriate  algorithm.   Recall  that 
the  SELECT  function  of  multiplication,  addition,  and  subtraction  are 
identical — a  simple  rounding  is  performed  on  the  y   most  significant  digits 
of  I'.,  the  lower  limit  for  proper  selection.   Table  5.1  gives  a  list  of 
the  possible  values  of  a,  B,  and  y.   The  first  a  most  significant  digits 
>' .    and  the  first  6  most  significant  digits  of  D  are  inspected  during 


Table  5.1  Result  Digit  Selector  Inputs 
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RADIX 

P(K) 

INDEX  DIFI 
(6) 

REDUNDANT  P. 
3 
a(bits)   Y(bits) 

D 
B(bits) 

TOTAL 

2 

KD 

5,6,7,8 

4(8) 

2(4) 

0 

8  +  1  =  9 

4 

2(2/3) 

4 

4(16) 

2(8) 

3(6) 

22  +  1  =  23 

5,6,7,8 

3(12) 

2(8) 

3(6) 

18  +  1  =  19 

3(1) 

3 

3(12) 

2(8) 

3(6) 

18  +  1  =  19 

4,5,6 

3(12) 

2(8) 

2(4) 

16  +  1  =  17 

16* 

11(11/15) 

4 

2(16) 

2(8) 

2(8) 

24  +  1  =  25 

12(4/5) 

3,4 

2(16) 

2(8) 

2(8) 

24  +  1  =  25 

13(13/15) 

3,4 

2(16) 

2(8) 

2(8) 

24  +  1  =  25 

14(14/15) 

3,4 

2(16) 

2(8) 

2(8) 

24  +  1  =  25 

15(1) 

3,4 

2(16) 

2(8) 

2(8) 

24  +  1  =  25 

*Cases  with  more  than  50  steps  in  any  single  overlap  region  were 
automatically  eliminated  from  consideration. 
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division.   The  complexity  of  base  16  for  the  result  digit  selection  is 

obvious.   The  binary  case  would  require  a  table  look  up  device  with  9 

available  address  lines:   8  for  the  first  4  redundant  bits  of  P.  and  1  to 

J 

indicate  which  operation  is  in  force.   The  particular  case  for  radix  4, 
p  =  3,  and  6=4  would  require  19  available  lines:   12  for  the  first  3 
redundant  digits  of  P.,  6  for  the  first  3  conventional  digits  of  D,  and  1 
for  control. 

For  the  radix  4  case,  the  number  of  address  lines  for  the  table 
look  up  is  somewhat  large  by  conventional  standards.   Two  techniques  to 
avoid  this  dilemma  are  available:   1)  use  a  PLA  with  the  previously  out- 
lined advantages  to  do  the  selection,  or  2)  perform  a  carry  propagation 
on  the  most  significant  portion  of  P.  to  reduce  the  number  of  lines 
required.   Table  5.2  indicates  the  number  of  address  lines  required  when 
a  carry  propagation  is  employed  on  the  most  significant  portion  of  P.. 
Note  that  division  always  requires  more  significant  digits  of  P.  for 
selection  than  the  other  operations.   By  using  all  of  the  available 
digits  for  selection  during  all  operations,  which  is  more  than  the 
absolute  minimum  required  for  multiplication,  addition,  and  subtraction, 
the  bounds  on  the  operands  during  addition,  subtraction,  and  multiplication 
can  be  relaxed  significantly  as  discussed  in  Chapter  3,  Section  3.4. 
The  comparison  constants  for  division  which  must  be  encoded  into  the 
table  look  up  for  quotient  digit  selection  are  given  in  Appendix  II 
for  several  cases.   The  comparison  constants  for  multiplication,  addition, 
and  subtraction  were  given  in  Chapter  3,  Section  3.3,  as  +  y,  +  1  y,  +  2  — , 
•-.,  ±  (P-  \). 
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Table  5.2  Result  Digit  Selector  Inputs  After  Carry  Propagation 


RADIX 

p(K) 

ENDEX  DIFF 
(6) 

:arry  propagated  p  . 
j 

a(bits)  y(bits*) 

D 
3(bits) 

TOTAL 

2 

KD 

5,6,7,8 

4(4) 

2(2) 

0 

4  +  1  =  5 

4 

2(2/3) 

4 

4(8) 

2(4) 

3(6) 

14  +  1  =  15 

5,6,7,8 

3(6) 

2(4) 

3(6) 

12  +  1  =  13 

3(1) 

3 

3(6) 

2(3) 

3(6) 

12  +  1  =  13 

4,5,6 

3(6) 

2(3) 

2(4) 

10  +  1  =  17 

16 

11(11/15) 

4 

2(8) 

2(7) 

2(8) 

16  +  1  =  17 

12(4/5) 

3,4 

2(8) 

2(6) 

2(8) 

16  +  1  -  17 

13(13/15) 

3,4 

2(8) 

2(6) 

2(8) 

16  +  1  =  17 

14(14/15) 

3,4 

2(8) 

2(5) 

2(8) 

16  +  1  =  17 

15(1) 

3,4 

2(8) 

2(5) 

2(8) 

16  +  1  =  17 

*From  Table  3.1 
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The  central  part  of  the  processing  logic — the  adder — is  also 
the  most  complex  part .   This  adder,  as  has  been  repeatedly  stated,  is  a 
multi- input  limited  carry-borrow  propagation  type  adder.   Such  adders 
have  been  the  subject  of  intense  study  in  much  of  the  literature  [GOY76, 
ROH67,  BOR68,  ROB75,  AVI61,  MET57].   As  a  result  it  will  not  be  examined 
in  detail  here.   Figure  5.6  gives  the  adder  I/O  requirements  for  the 
hardware  block  diagram  being  discussed,  while  Figure  5.7  represents  the 
i   digit  position  of  this  adder.   A  carry  generator  will  be  needed  if  a 
radix  complement  representation  for  negative  numbers  is  used. 

5.4.2  Hardware  Modifications  of  AURORA  for  Pipelining 

If  the  pipelining  approach  is  used — stringing  multiple  AURORAS 
together — as  described  in  Section  5.3,  an  adjustment  of  the  processing 
logic  makes  it  suitable  for  this  technique.   An  extra  adder  can  be  used 
to  form  the  residual,  as  discussed  in  Section  5.3,  which  must  he  passed 
from  one  unit  to  another.   This  residual  is  in  the  form  of 

*<Vr»j-i> 

for  multiplication,  addition,  and  subtraction,  and  is  in  the  form  of 

rP.    -  q.D. 
j-1    Mj  j-1 

for  division.  Additionally,  the  two  operand  digits,  x.  and  y.,  must  be 
passed  in  multiplication,  and  the  divisor  and  quotient  digits  passed  in 
division  during  each  cycle. 

The  hardware  block  diagram  for  this  approach  is  shown  in  Figure  5.8, 
Note  that  the  addition  of  the  second  adder  has  simplified  the  input 
requirements  of  the  main  adder.   The  control  for  this  setup  would  be 
Lex  than  Lti  the  serial  arrangement. 
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5.4.3  Speed  of  the  Processing  Logic 

An  attempt  to  bound  the  speed  of  the  serial  processing  logic  is 
now  appropriate.  Define  the  time  to  perform  one  basic  recursion  in  the 
standard  processing  logic  (Figure  5.5)  as 

tR=tT+tA+   CS 
where  tT  is  the  register  and  selection  network  transfer  time,  t.  is  the 
redundant  add  time,  and  t   is  the  result  digit  selection  time.   Both  t^ 
and  t   correspond  to  a  few  gate  delays — 4  to  5  t  .   So  t  appears  to  be 
the  dominant  factor  in  the  equation  and  depends  upon  the  radix  (and,  thus, 
the  number  of  inputs  to  the  adder)  and  the  adder  structure.   Let  s  be 
defined  to  be  the  number  of  radix  2  summands  (inputs  to  the  adder);  i.e., 
higher  radix  and  redundant  inputs  are  considered  in  their  binary  equiva- 
lent formats.   Then  a  simple  adder  structure,  consisting  of  (s-2)  levels 
of  full  adder  rows  [GOY76]  will  have  a 

tA  "  2<a-2)tg   > 
assuming  2t  per  full  adder.   More  sophisticated  adder  structures,  such  as 

D 

Dadda-types  [H073]  can  considerably  reduce  this  time.   Using  this  formula 

for  t.,  the  total  recursion  time  for  a  radix  2  case,  where  the  worst  case 
A 

s  over  all  operations  is  5  (i.e.,  1  conventional  input  and  2  redundant 
inputs)  is 

tR  -  0(Utg) 

In  the  case  of  radix  4,  this  increases  to  0(26t  ). 

g 

Depending  on  the  operand  size  input,  a  number  of  cycles  must  occur 
•re  a  result  is  completely  generated  as  tabulated  in  Table  5.3.   While 
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Table  5.3  Word  Processing  Time  Delay 


TOTAL  WORD 

WORD 

DIGITS /WORD 

PROCESSING 

LENGTH 

RADIX 

BITS/DIGIT 

(//OF  CYCLES) 

TIME/ CYCLE 

TIME 

4 

2 

1 

4 

14  tg 

56  t 
g 

(*) 

4 

2 

2 

26  tg 

52  t 
g 

(*) 

16 

4 

1 

98  tg 

98  t 

g 

(*) 

8 

2 

1 

8 

14  tg 

112  t 
g 

4 

2 

4 

26  tg 

104  t 
g 

16 

4 

2 

98  tg 

196  t 

g 

16 

2 

1 

16 

14  tg 

224  t 
g 

4 

2 

8 

26  tg 

208  t 
g 

16 

4 

4 

98  tg 

392  t 

g 

(*)These  cases  do  not  satisfy  the  on-line  delay  requirements 
of  division. 
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using  a  higher  radix  can  speed  up  the  overall  processing  time  (112t   for 
base  2  versus  104t   for  base  4  given  an  8  bit  input  width) ,  it  also 
increases  the  complexity  of  the  adder.   From  this  table  it  would  appear 
that  radix  4  may  be  the  best  compromise.   Table  5.5  reinforces  this  choice 
in  comparing  speed  and  complexity  of  the  processing  logic.   Once  again  the 
trade  off  between  high  speed  and  low  gate  count  is  evident. 

5.4.4  AURORA  as  a  Total  Module 

A  block  diagram  of  the  total  unit  is  given  in  Figure  5.9.   It 
includes  the  PLA  control  logic  as  discussed  in  Section  5.1,  the  processing 
logic  just  discussed,  the  recoding  logic  as  discussed  in  Appendix  III, 
and  the  I/O  requirements  of  the  overall  module.   An  approximate  pin  count 
for  base  2,  4,  and  16  is  given  in  Table  5.4. 

Table  5.4   Rough  Pin  Count  for  AURORA 


INPUT 

OUTPUT 

PINS  TO 

WORD 

#  OPERANDS 

//  RESULT 

COMMUNICATE 

LENGTH 

PINS 

PINS 

WITH  EXP.  UNIT 

CONTROL 

TOTAL 

4 

8 

4 

4 

15 

31 

8 

16 

8 

4 

15 

43 

16 

32 

16 

4 

15 

67 

Table  5.5   Speed  vs.  Complexity 
(Processing  Logic) 
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WORST  CASE 
tfORD  PROCESS 

H  O  !3 

3  ^  Q 

(LEVELS) 
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TIME 

2 

4 
8 

1(1) 
1(1) 

3  levels 
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112  t 
g 

16 

1(1) 
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g 

4 

4 
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— 

8 
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g 

8 
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g 

16 
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3 

4H 
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208  t 
g 

16 

1(3) 

15  LINES/19  WORDS 

208  t 
g 

16 

4 
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— 

— 

8 
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CO  XI 
rH  4J 

— 

— 

16 
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t  is  one  gate  delay 
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Figure  5.9  Block  Diagram  of  an  AURORA 
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5.5  System's  Level  Overview 

AURORA  was  designed  to  operate  as  a  functional  module  controlled 
externally  by  a  larger  system.   This  system  must  then  provide  AURORA  with 
the  correct  operand  words  and  control  signals  and  receive  from  AURORA  the 
result  word  for  storage.   This  system  must  fetch  the  correct  number  of 
operand  words  in  the  proper  order  from  main  memory,  transfer  them  on-line 
to  AURORA,  and  store  the  result  words  in  the  proper  order  back  into  main 
memory.   To  avoid  the  necessity  of  dedicating  the  system  to  AURORA  while 
the  digits  are  being  processed,  the  AURORA  unit  should  signal  the  main 
system  via  an  interrupt  when  a  result  word  is  ready  for  storage.   Then  the 
system  is  free  to  handle  other  processes  while  AURORA  is  functioning. 
Figure  5.10  shows  a  sample  system  organization. 

This  chapter  has  consisted  of  an  overview  only  of  the  necessary 
considerations  of  actual  implementation.   The  implementation  as  such  was 
not  attempted,  but  is  left  for  a  more  suitable  medium.   A  detailed  gate 
count  was  considered  inappropriate  and  ill  advised.   It  remains  to  justify 
the  existence  of  such  a  unit  as  AURORA. 
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6 .   SUMMARY 

6.1   Summary  of  the  Results 

The  algorithmic  and  logical  design  of  an  arithmetic  unit  to  be 
used  in  a  computational  environment  in  which  the  basic  arithmetic 
operations  satisfy  the  on-line  property  has  been  presented.   The  on-line 
property  requires  that  to  generate  the  j   digit  of  a  result  (where  a 
digit  consists  of  n  bits  for  base  2  ),  it  is  necessary  and  sufficient  to 
have  the  operands  available  only  up  to  the  j    digit  plus,  in  the  case 
of  division,  a  predetermined  number  of  extra  digits  which  correspond  to 
an  "on-line  delay."   Since  there  is  no  on-line  delay  for  addition, 
subtraction,  and  multiplication,  the  algorithms  can  begin  generating 
result  digits  as  soon  as  one  digit  of  each  operand  has  been  input.   The 
delay  for  division  was  shown  to  be  a  small,  positive,  radix  dependent 
constant.   The  fulfill  the  on-line  requirements,  a  set  of  left-to-right 
(most-to-least  significant),  digit-by-digit  algorithms  have  been  derived. 
The  existence  of  such  algorithms  is  contingent  upon  using  a  redundant 
representation  for  the  result  digits.   These  algorithms  and  a  block  diagram 
implementation  of  the  basic  arithmetic  unit  has  been  presented. 

Algorithms  for  addition  and  subtraction  which  conform  to  this 
on-line  property  were  easily  specified,  while  the  multiplication  algorithm 
required  a  somewhat  more  elaborate  approach.   The  existence  of  an  on-line 
division  algorithm  was  only  recently  discovered  [TRI75];  the  bulk  of 
this  thesis  was  a  development  and  extension  of  this  early  work.   Quotient 
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digit  selection  procedures  based  upon  a  limited  precision  model  of  the  operands 
and  minimum  values  for  the  on-line  delay  were  discussed  in  detail.   Once 
compatible  algorithms  for  the  basic  operations  were  defined,  the  problems 
of  actual  implementation  of  the  basic  unit,  AURORA,  were  then  considered. 
Suitability  to  LSI,  floating  point  processing,  and  adjustments  in  the  hard- 
ware to  gain  speed  were  some  of  the  desirable  design  goals. 

It  is  now  fitting  to  discuss  some  possible  applications  of  such  a 
unit  to  justify  its  existence. 

6.2  Applications 

It  has  been  stated  that  a  unit  suitable  for  investigation  as  an 
LSI  chip  should  have  as  large  a  domain  as  possible.   Several  application 
areas  are  ripe  for  exploitation  today,  while  others,  as  advances  are  made 
in  technology,  are  on  the  horizon.   The  following  list  of  applications  is, 
by  no  means,  exhaustive,  but  is  presented  here  as  a  representative  study. 
1)   The  most  obvious  use  for  AURORA  is  in  the  area  of 

real-time  applications.   As  the  operands  are  generated 
serially  by  an  analog-to-digital  conversion  process 
beginning  with  the  most  significant  digits,  AURORA 
could  be  used  to  process  these  operand  digits  as  soon 
as  they  became  available  from  the  converter.   This 
is  unlike  the  conventional  setup,  where  the  processing 
unit  must  wait  while  the  full  precision  operands  are 
converted  before  starting  operation.   The  speed  up 
benefits  are  obvious.   Such  a  system,  with  the 
capability  of  overlapping  conversion  and  computation, 
has  a  definite  place  in  long  distance  communications 
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and  satellite  systems.   In  fact,  any  system  designed 
to  be  of  use  in  a  real-time  environment  could  make 
significant  gains  with  the  addition  of  an  AURORA 
module  to  its  hardware. 
2)   Another  possible  application  is  in  performing  variable 
precision  arithmetic.   The  described  algorithms  and 
simple  implementation  requirements  of  AURORA  are 
compatible  with  the  required  modularity  of  any 
variable  precision  unit.   Both  hardware  implementations 
as  discussed  in  Chapter  5,  serial  processing  on  one 
unit  or  the  pipelining  of  multiple  AURORAS  for  speed, 
provide  variable  precision  arithmetic  in  a  straight- 
forward manner.   Of  course,  the  ultimate  allowable 
precision  is  set  by  the  internal  hardware  of  AURORA 
as  with  any  such  device.   It  is  believed  that  sufficient 
register  and  adder  widths  can  be  provided  by  large 
scale  integrated  technology  to  provide  enough  "variable 
precision  arithmetic"  to  meet  the  demands  of  most 
applications.   The  desirability  of  being  able  to  attach 
an  AURORA  as  a  peripheral  device  on  a  microprocessor 
bus  is  obvious.   The  microprocessor,  a  device  tradition- 
ally restricted  from  most  mathematical  applications 
because  of  its  short  word  length,  could  not  help  but 
benefit  from  this  boost  in  processing  power.   As  an 
added  bonus,  since  AURORA  signals  completion  via  an 
interrupt,  the  microprocessor  would  not  have  to  be 
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dedicated  to  it.   While  the  AURORA  is  functioning,  the 
main  system  would  be  freed  to  handle  other  pending 
processes. 

3)   If  the  approach  of  connecting  multiple  AURORAS  together 
is  used  for  actual  implementation,  pipelining  of 
successive  operations  could  be  used  as  an  effective 
speed  up  technique.   Recall  that  the  recursive  function- 
ing ripples  down  from  the  most-to-least  significant 
unit.   When  the  first  unit  completes  processing  and 
returns  the  result  digit  to  the  main  system,  it  has 
also  passed  all  necessary  information  for  that  operation 
to  correctly  continue  on  down  the  pipe  to  the  next  unit. 
When  one  unit  has  completed  all  of  the  processing 
associated  with  the  present  operation,  the  next  unit  in 
line  can  begin  generating  the  next  result  digit 
associated  with  that  same  instruction.   Thus,  the  one  unit 
is  free  to  initiate  processing  on  the  next  instruction 
in  the  program.   In  this  way,  the  fraction  arithmetic 
unit,  which  has  been  traditionally  considered  as  a 
single  stage  of  the  pipeline  [STE75,  AND67],  can  be 
further  decomposed  into  multiple  stages  to  speed  up 
processing  even  more.   Chaining  operations  on  result 
digits  as  they  are  generated  can  increase  processing 
speed  even  further. 

h)      One  application  on  the  horizon  will  become  more  feasible 
as  technological  improvements  make  way  for  the  wide 
spread  use  of  large  serial  memories  (CCDs,  bubble 


115 

memories,  etc.).   The  major  user  of  the  large  serial 
memories  will  be  data  base  systems.   AURORA  could  be 
used  in  conjunction  with  these  serial  memories  to  provide 
instant  processing  capabilities  for  data  base  systems. 
As  soon  as  the  most  significant  word  was  (serially) 
extracted  from  the  memory,  the  decisions  as  to  which 
actual  memory  word  is  desired  could  be  made  on-line. 
Then,  just  that  operand  would  have  to  be  read  from 
memory.   The  only  other  alternative  is  to  read  both 
operands  from  memory,  a  lengthy  process,  and  then  pro- 
ceed to  compare  them  to  determine  which  is  the  desired 
operand.   By  using  AURORAS  to  provide  on-line  processing, 
intelligent  data  base  retrieval  systems  could  be  built. 

6.3  Suggestions  for  Future  Research 

During  work  on  this  dissertation,  several  extensions  or  other  areas 

of  interest  requiring  further  research  have  become  obvious  to  this  author. 

They  include  the  following  questions. 

1)   Could  the  on-line  processing  technique  employed  here  be 
extended  to  other  functions,  such  as  logarithmic, 
trigonometric,  and  exponential?   It  is  believed  that 
an  alternative  algorithmic  approach  to  processing,  such 
as  continued  products  as  described  by  DeLugish  [DEL70] 
or  the  E-method  as  developed  by  Ercegovac  [ERC75]  might 
be  more  appropriate  for  extension.   The  on-line  result 
digit  selection  procedure  outlined  in  this  thesis  could 
be  carried  over  with  slight  modification  to  these 
techniques.   This  area  is  in  need  of  further  research. 
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2)   What  about  actual  implementation?   It  is  beyond  the 
influence  of  this  author  whether  or  not  an  actual 
AURORA  module  is  manufactured  as  a  single  chip.   It  is 
entirely  feasible  to  build  the  module  from  several 
standard  MSI  and  SSI  chips  now  available  on  the  market. 
An  even  more  attractive  alternative  would  be  to 
implement  AURORA  in  software  using  a  suitable  bit-slice 
microprocessor.   This  seems  to  be  more  useful  than 
simulating  AURORA  on  a  large  system  equipped  with  an 
appropriate  simulation  language.   The  microprocessor 
implementation  is  one  area  of  continuing  research  in 
which  this  author  hopes  to  be  actively  engaged. 
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APPENDIX  I 
REDUNDANCY  DEFINITION 

Redundancy  (the  state  of  being  in  excess  of  what  is  necessary)  is 
used  in  the  implementaion  of  computer  arithmetic  to  achieve  three  design 
goals:   improved  reliability,  increased  speed,  and  structural  flexibility. 
To  achieve  the  first  goal,  hardware  redundancy  and/or  redundant  arithmetic 
codes  are  applied  to  the  detection  and  correction  of  faults.   This,  however, 
is  not  the  type  of  redundancy  referred  to  in  this  dissertation.   Rather, 
the  use  of  number  systems  employing  redundancy  in  their  representation  is 
implied.   In  this  way  the  design  goals  of  increased  speed  and  structural 
flexibility  can  be  achieved.   A  positional  number  system  with  fixed  radix, 
r,  is  redundant  if  the  allowable  digit  set  includes  more  than  r  distinct 
elements,  thereby  affording  alternate  representations  of  a  given  numeric 
value.   Uniqueness  of  representation  is  sacrificed  with  the  hope  of  gains 
in  speed  and  flexibility. 

The  type  of  redundant  number  representation  used  throughout  this 
dissertation  calls  for  the  use  of  a  symmetric  redundant  digit  set,  defined 


as 


where 


D  =  {-p,-(p-l),.. .1,0,1, ...(p-l),p} 


f  1  P  1  r  ~  !   • 
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In  particular,  D  is 


1)  minimally  redundant  if  (where  | D  |  is  the  cardinality  of  the  digit 
set) 

l°pl  -  r  +  1 

so  that 

r 
P  =  2 

2)  or  maximally  redundant  if 

|D  I  =  2r  -  1 
1  P1 

so  that 

p  =  r  -  1 

Consequently,  the  representation  of  a  number  X  is  simply 

m 
X  =  Z     x.r  1 
1-1   X 

where  the  sign  of  X  is  just  the  sign  of  x..  . 
EXAMPLE: 

For  radix  r  =  4 

D  =  {0,1,2,3} 
Vn  =  {2,I,0,1,2} 
°PMAX  =  ^,2,1,0,1,2,3} 

where  the  overbar  denotes  the  negative  sign,  i.e.,  2  =  -2.   Then 
X  =  0.68751Q  =  0.10112  =  0.234 

D  =  0.11012  =  0.3l4  for   x.  e    D^ 
D  =  1.11012  =  1.114   for   x.  «  DpMiN 
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Some  of  the  desirable  properties  of  this  type  of  number  representa- 
tion are: 

1)  The  representation  of  zero  is  unique.   An  algebraic  value 
of  X  =  0,  if  and  only  if  all  x.  -  0. 

2)  The  additive  inverse  (negation)  of  an  operand  is  very  simply 

achieved  by  reversing  the  sign  of  every  non-zero  digit 

\ 

individually. 

3)  The  sign  of  the  algebraic  value  of  X  is  given  by  the  sign 
of  the  most  significant  (leftmost)  non-zero  digit. 
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APPENDIX  II 
SAMPLE  P-D  PLOTS  FOR  DIVISION 

This  appendix  contains  sample  P-D  plots  for  on-line  division. 
The  key  in  the  top,  left-hand  corner  of  each  plot  corresponds  to 

BASE:    the  radix(r) 

RHO:     the  maximum  quotient  digit  (p) 

DELTA:    the  on-line  delay  (6) 

ALPHA:    sufficient  precision  of  rP.  for  quotient 
digit  selection  (a) 

BETA:    sufficient  precision  of  D.  for  quotient 
digit  selection  (6) 
The  comparison  constants  for  the  treads  (rP.)  are  given  in  the  right-hand 
column.   The  comparison  constants  for  the  risers  (D)  are  given  along  the 
top  row.   In  the  case  of  base  16  several  anomalies  are  apparent:   1)  when 
an  overlap  region  required  more  than  50  steps,  those  steps  were  not 
plotted;  and  2)  the  comparison  constants  are  not  shown  due  to  insufficient 
space. 


Table  A.l  Example  of  the  Algorithm  DIVIDE 

(r=2,K=l) 


r  =  2,  6  =  5,  m  =  8,  D  =  {1,0, 1},  K  =  1 

p 


N  =  0.10100011 
D  =  0.11110110 


Ls 


TO  INSURE  CONVERGENCE - 
SHIFT  N  RIGHT  ONE  BIT 


(R  =  0.10101001101...) 


] 
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D. 
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2P. 
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Vi 

-4 
R.  2 
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0.11110 

0.10100 

1 

0.0 

1 

0.111101 

2(0.10100-0.111101)  = 

-0.10101 

T 

0.00001 

2 

0.1111011 

2 (-0.1011+0. 1111011 
-0.00001)  = 

0.100111 

1 

0.000001 

3 

0.11110110 

2(0.101001-0.1111011)  = 

-0.101001 

T 



4 

0.11110110 

2) -0.101001+0. 1111011)  = 

0.101001 

1 



5 

0.11110110 

2(0.101001-0.1111011)  = 
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1 



6 

0.11110110 

2 (-0.101001+0. 1111011)  = 

0.101001 

1 



7 

0.11110110 

2(0.101001-0.1111011)  = 

-0.101001 

1 



8 

0.11110110 

2 (-0.101001+0. 1111011)  = 

0.101001 

1 

__  — 

R  = 


8 
2(  Z   q  2 

I 


-i 


2(0.1111111111) 
2(0.0101010101)  =  0.10101010 


D 
D 
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APPENDIX  III 
AN  ON-LINE  RECODING  ALGORITHM 

The  result  digits  generated  by  the  algorithms  discussed  in  this 
dissertation  are  in  a  redundant  format.   Before  they  can  be  output  from 
the  unit  they  must  be  converted  to  a  conventional  format.   As  each 
redundant  result  digit  becomes  available  from  the  result  digit  selector, 
it  is  stored  in  the  full  width  double  bank  result  register.   Given  the 
result  in  the  form 


m 
R  =   E   s.r"1 
1=1   X 


where 


s±   e  {-p,-(p-l),...l,0,l,...(p-l),p} 


it  must  be  recoded  to  the  form 


where 


m 
R'  =  Z  s'r  x 
i=l 


s!  e  {0,1,2, ...r-1} 


The  "mostly"  on-line  recoding  algorithm  is  shown  in  Figure  A.l.   Note 

that  the  only  information  needed  to  recode  the  present  digit  is  the  overall 

sign  of  the  result  (S   )  and  the  sign  of  the  next  rightmost  non-zero 

op 

digit  (S  ).   So,  the  only  time  the  recoding  network  fails  to  produce  an 
n 

output  on-line  is  when  it  encounters  a  string  of  zeros  in  the  result. 
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op 

V 


S  : 
n 


x.: 


r: 


sign  of  the  overall  result 

sign  of  the  present  result  digit 

sign  of  the  next  rightmost  nonzero  result  digit 

present  result  digit 

radix 


Figure  A.l   The  On-line  Recoding  Algorithm 
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